[liberationtech] Opinion on a paper?

Sun Sep 9 19:20:44 PDT 2012

I second the criticism about the assumptions of a 'perfect population
register'. This is a much broader problem, as shown by the Netflix
case. For a good synopsis, see Pete Warden's take on the problem, some
examples of how external data can be used to help reverse anonymized
data, and some suggestions for ways to operate with imperfect
anonymization:
  http://strata.oreilly.com/2011/05/anonymize-data-limits.html

You certainly don't need to be high-profile, either, like the article
suggests. Last year I was working on disease outbreak tracking. There
was an actual case where a girl in East Africa had been reported as
testing positive to Ebola. Her village was named in reports and this
was a region where victims of diseases are often vilified and
sometimes killed. She would have likely been the only person from her
village who was rushed to a hospital at that time (and more likely the
only girl of her age-bracket). It would have been simple for everyone
from her village to immediately make the connection. We decided we
would not want to publish this information, even though many other
health organizations did. Her diagnosis was ultimately incorrect,
which doesn't really affect the anonymization issue, but it makes any
identification/vilification even more disturbing.

We were information managers and health professionals, not lawyers,
and the international aspect no doubt complicates things. I assume
that the health organizations who did publicize this acted within the
law. For us, this wasn't enough. If it was reported in a health
journal 5 years later? That might be ok. But as real-time report it
was clearly unethical. I doubt the other organizations published this
in malice - it was one piece of information among many - but it
highlights the problem.

Rob

On 9 September 2012 15:30, Joss Wright
<joss-liberationtech at pseudonymity.net> wrote:
> On Sun, Sep 09, 2012 at 07:19:22PM +0000, Paul Bernal (LAW) wrote:
>
>> I wondered if anyone had an opinion on it - I don't have the technical
>> knowledge to be able to evaluate it properly. The basic conclusion
>> seems to be that re-identification of 'anonymised' data is not nearly
>> as easy as we had previously thought (from the work of Latanya
>> Sweeney, Paul Ohm etc). Are these conclusions valid?
>>
>> My concern is that I can see this paper being used to justify all
>> kinds of potentially risky information being released - particularly
>> health data, which could get into the hands of insurance companies and
>> others who could use it to the detriment of individuals. On the other
>> hand, if the conclusions are really valid, then perhaps people like me
>> shouldn't be as concerned as we are.
>
> Hi Paul,
>
> I've gone over this paper quite quickly, partially because it's late
> here and I should be asleep; apologies for any bizarre turns of phrase,
> repetition (hesitation or deviation...), or bad-tempered
> comments. :)
>
> I'll also certainly defer to the hardcore reidentification experts if
> they turn up.
>
> (This email has become slightly longer than I intended. To sum up:
> "Lots of problems. False assumptions. Cherry-picked examples. Ignores or
> wholly misunderstands subsequent decade of research. Somewhat
> misrepresents statistics.  Wishful-thinking recommendations. Correct in
> stating that we don't need to delete all data everywhere in order to
> avoid reidentification, but that's about it.")
>
> My initial response is that the paper is partially correct, in that the
> Sweeney example was a dramatic, anecdotal demonstration of
> reidentification and shouldn't be taken as representative of data in
> general. On the other hand, the paper goes wildly off in the other
> direction, and claims that the specifics of the Sweeney example somehow
> demonstrate that reidentification in general is barely feasible and can
> easily be handled with a few simple rules of thumb.
>
> Overall, I would say that there are a number of serious flaws in the
> arguments of the author.
>
> Firstly, the paper is predicated almost entirely on what the author
> refers to as `the myth of the perfect population register' -- that
> almost no realistic database covers an entire population, and so any
> apparently unique record could in fact also match someone outside of the
> database. This is certainly true, but is used by the author to justify
> an assumption that does not hold, in my opinion.
>
> This assumption, the largest conceptual flaw in the paper, is that a
> reidentification has to be unique and perfect to be of any value. The
> author claims, based on the `perfect population register', that because
> some reidentified record, relating to, say, health information of an
> individual, could potentially match that of someone that wasn't in the
> database, that there is no guarantee that the record is accurate, and
> thus the reidentification is useless. This is not true -- even such
> partial or probabilistic reidentifications reduce the set of
> possibilities, and reveal information regarding an individual. This can
> be used and combined with further data sources to achieve either
> reidentification, if that is the goal, or simply the revelation of
> sensitive personal information.
>
> As an example: Sweeney used William Weld's unique characteristics in
> the voter database to reidentify his anonymous health data. As some
> hypothetical `Person X' who was not in the voter database could have
> matched those apparently unique characteristics, the anonymous health
> data could have belonged to Person X rather than William Weld. As the
> author notes, this is overcome in the Sweeney case by making use of
> public information to confirm that the data was that of William Weld --
> the author seems to believe that any such auxiliary information for
> other individuals could not reasonably exist, despite the existence of
> Google and Facebook.
>
> The author takes from this that any partial or probabilistic
> reidentification is therefore worthless, and claims that it was only the
> widely publicized `auxiliary information' about William Weld's health
> status that made such reidentification possible.
>
> What the author fails to address is that the availability of such
> auxiliary information is exactly what is being made available with
> greater and greater frequency by the release of poorly-anonymised
> databases. As such, whilst the initial reidentification cannot be made
> with perfect accuracy, subsequent pieces of auxiliary information can be
> used to verify, research and identify an individual. (Of course, an
> attacker may simply be seeking to gain a given piece of sensitive
> information, so a true `reidentification' may not be a useful goal in
> considering the risks of such databases.)
>
> The author states in the abstract that `... most re-identification
> attempts face a strong challenge in being able to create a complete and
> accurate population register', and claims that this strong assumption
> underlies most other reidentification work. (Using the entirely
> objective phrase `somewhat furtive "insider" trade secret'.) In fact,
> this strong assumption is entirely too strong, and is given as an
> assumption only by the author themselves. I would point to the seminal
> Shmatikov and Narayanan work on the Netflix Prize for a deeper analysis
> that shatters exactly this kind of assumption. This claim by the author
> is somewhat of a strawman argument, and one on which the entire paper is
> based.
>
> A second flaw comes in switching several times, according to the
> argument needed, as to whether the attacker is interested in identifying
> a targeted individual (`We need William Weld's data'), or whether any
> individual will do (`We need someone's data, but don't care who it
> is.'). These raise very different problems, and different sets of
> statistics, and need to be clearly separated in analysis.
>
> A third flaw, related to the first and epitomised by the section
> starting with the final paragraph of page 6, is that an attacker would
> need to somehow build their perfect database before reidentifying an
> individual. The author states that the attacker would have to check all
> other individuals outside of the original database to complete the
> reidentification. In fact, they could simply seek alternative forms of
> auxiliary information to make their reidentification more and more
> certain. I do find it bizarre that the author makes this claim, as the
> more intelligent approach of using auxiliary information is precisely
> that employed by Sweeney in the case of William Weld.
>
> The author does address the problem of probabilistic reidentification at
> the latter stages of the paper (top of page 9) but dismisses it
> entirely, and unreasonably, out of hand. I could write a whole essay on
> this particular argument, but I'll simply note that with a 35% chance
> for error, you simply have a very good starting point to find extra
> auxiliary information to reduce your error to whatever you decide is
> acceptable. (This should not be ignored, however, as the author's
> insistence that reidentification must be 100% certain is probably the
> deepest flaw here.)
>
> A more worrying problem comes in the surprising lack of coverage of any
> of the subsequent, and equally highly publicized, reidentification
> attacks, or any of the developments in anonymisation since k-anonymity.
> Even if we brush aside the vast amount of work on differential privacy,
> which is extremely popular in anonymity research today, the author has
> not addressed concepts such as l-diversity or t-closeness, which would
> seem necessary for a reasonable study.
>
> (As a quick example, consider this application of an l-diversity
> problem: We cannot identify William Weld uniquely in the health
> database, but we can isolate him as one of four people. All of those
> four have been prescribed antidepressants in the last six months, and
> three are being treated for an STD. No perfect reidentification, but
> certainly a sensitive data leak for the poor governor.)
>
> The total lack of coverage of, for example, Shmatikov and Narayanan's
> reidentification of the Netflix Prize dataset, and the (wonderful)
> analysis and methodology used there show a worrying lack of familiarity
> with the state of the art, and certainly call into question the
> conclusions drawn from the author's analysis.
>
> I do find the total focus on the Sweeney example, and the picking apart
> of the details, a very concerning example of the kind of thinking that
> often surrounds anonymisation: that by fixing the specific problem that
> you identify with a specific example, you can fix the wider problem.
> This is a `patching up the holes' approach, rather than an attempt to
> systemically fix a problem; this has rarely been shown an effective
> strategy, particularly in computer security. ("This was caused by a
> combination of gender, birthdate and zip code? Quick, make those
> sensitive pieces of data!")
>
> The recommendations at the end of the paper are simply unrealistic.
> Point by point:
>
> 1) Make it illegal to reidentify data -- this approach has been
> criticised at length in the literature, as the author acknowledges and
> dismisses, but I would focus particularly here on how difficult it is to
> detect reidentification attempts. This will stop only the most ethical
> of attackers.
>
> 2) Require anyone linking in new data to maintain anonymity --
> recognizes the problem of auxiliary information, but somehow ignores it
> at the same time.
>
> 3) Give data `anonymous' status, but allow that status to be withdrawn
> -- I assume that all the copies of the dataset will automatically
> self-destruct once this status is withdrawn.
>
> 4) Specify that recipients must comply with restrictions -- if you can
> state this then you have already solved most of the world's problems.
> More seriously, this (and other recommendations here) seem to conflate
> anonymisation that is shared with trusted researchers, which /is/ less
> of a problem, with anonymisation that is released to the public. If you
> are restricting access, there are a lot of extra approaches that you can
> employ.
>
> This is extremely important to understand, as the public release of data
> continually combines to provide more and more auxiliary data. This is
> why it is critically important that data for public release is
> anonymised, as there is no realistic way to pull that data back once it
> is in the public domain. All information is auxiliary information for
> the next attack.
>
> 5) Require that data holders are secure -- again, this is a fine wish,
> but gives nothing practical.
>
> 6) Data use agreements that pass on to further recipients -- trust is
> not commutative, and this holds most of the same wishful thinking
> problems as the other recommendations here.
>
> All of these recommendations are based on an assumption of trust, good
> faith and playing by the rules. In short, entirely the opposite of
> conventional security-based thinking. While we shouldn't throw away
> everything to meet some puritanical ideal of security, we shouldn't
> ignore an entire field of study because we don't like their conclusions.
>
> I don't entirely dismiss the need for a regulatory approach to this. In
> fact, several of these recommendations are reasonable if combined with
> other, stronger, guarantees. There should be penalties for misuse of
> data, or poor anonymisation, but they should be backed up at the
> technical level by effective techniques that can safeguard information.
>
> More importantly, none of these recommendations provide any kind of
> practical or constructive approach to best practice for anonymising
> data, or how to weigh up the risk or effects of data release. This seems
> to follow the overall tone of the paper that these risks are not a
> concern.
>
> The final conclusions of the paper are that the Sweeney example was not
> representative, and I agree; I also wholly disagree with almost all of
> the analysis and conclusions of the paper. From the choice use of
> language regarding, particularly, the `somewhat furtive "insider" trade
> secret', the author clearly believes that researchers into
> reidentification are massively and knowingly overplaying the chances of
> reidentification. I resent that.
>
> The one point on which I do agree is that there needs to be a balance
> between the benefits of access to large-scale databases, and the risks
> of reidentification. Where that point of balance should be is, I think,
> something on which I would strongly disagree with the author; although
> perhaps not as much as one might think.
>
> I do fully appreciate that the author comes from the perspective of
> wanting to use data for the greater good, and that some claims of the
> risks of database release are overly cautious. This paper, though,
> massively overstates the difficulties, and massively understates the
> risks.
>
> We should have a better understanding of the actual risks of
> reidentification, and weigh this against the benefits from access to
> aggregate personal data. The way to do this, however, is in a
> broad-based study of the real-world risks, research into the means for
> reidentification and anonymisation, and a systemic approach to the
> protection of personal data; not by hand-waving away the risks by
> picking apart one unrepresentative example and ignoring the subsequent
> decade of active research into the area.
>
> Happy to answer any other questions, on- or off-list.
>
> Joss
> --
> Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech

-- 
Idibon
www.idibon.com
www.robertmunro.com