[liberationtech] Opinion on a paper?

Sun Sep 9 15:30:33 PDT 2012

On Sun, Sep 09, 2012 at 07:19:22PM +0000, Paul Bernal (LAW) wrote:

> I wondered if anyone had an opinion on it - I don't have the technical
> knowledge to be able to evaluate it properly. The basic conclusion
> seems to be that re-identification of 'anonymised' data is not nearly
> as easy as we had previously thought (from the work of Latanya
> Sweeney, Paul Ohm etc). Are these conclusions valid?
> 
> My concern is that I can see this paper being used to justify all
> kinds of potentially risky information being released - particularly
> health data, which could get into the hands of insurance companies and
> others who could use it to the detriment of individuals. On the other
> hand, if the conclusions are really valid, then perhaps people like me
> shouldn't be as concerned as we are.

Hi Paul,

I've gone over this paper quite quickly, partially because it's late
here and I should be asleep; apologies for any bizarre turns of phrase,
repetition (hesitation or deviation...), or bad-tempered
comments. :)

I'll also certainly defer to the hardcore reidentification experts if
they turn up.

(This email has become slightly longer than I intended. To sum up:
"Lots of problems. False assumptions. Cherry-picked examples. Ignores or
wholly misunderstands subsequent decade of research. Somewhat
misrepresents statistics.  Wishful-thinking recommendations. Correct in
stating that we don't need to delete all data everywhere in order to
avoid reidentification, but that's about it.")

My initial response is that the paper is partially correct, in that the
Sweeney example was a dramatic, anecdotal demonstration of
reidentification and shouldn't be taken as representative of data in
general. On the other hand, the paper goes wildly off in the other
direction, and claims that the specifics of the Sweeney example somehow
demonstrate that reidentification in general is barely feasible and can
easily be handled with a few simple rules of thumb.

Overall, I would say that there are a number of serious flaws in the
arguments of the author.

Firstly, the paper is predicated almost entirely on what the author
refers to as `the myth of the perfect population register' -- that
almost no realistic database covers an entire population, and so any
apparently unique record could in fact also match someone outside of the
database. This is certainly true, but is used by the author to justify
an assumption that does not hold, in my opinion.

This assumption, the largest conceptual flaw in the paper, is that a
reidentification has to be unique and perfect to be of any value. The
author claims, based on the `perfect population register', that because
some reidentified record, relating to, say, health information of an
individual, could potentially match that of someone that wasn't in the
database, that there is no guarantee that the record is accurate, and
thus the reidentification is useless. This is not true -- even such
partial or probabilistic reidentifications reduce the set of
possibilities, and reveal information regarding an individual. This can
be used and combined with further data sources to achieve either
reidentification, if that is the goal, or simply the revelation of
sensitive personal information.

As an example: Sweeney used William Weld's unique characteristics in
the voter database to reidentify his anonymous health data. As some
hypothetical `Person X' who was not in the voter database could have
matched those apparently unique characteristics, the anonymous health
data could have belonged to Person X rather than William Weld. As the
author notes, this is overcome in the Sweeney case by making use of
public information to confirm that the data was that of William Weld --
the author seems to believe that any such auxiliary information for
other individuals could not reasonably exist, despite the existence of
Google and Facebook.

The author takes from this that any partial or probabilistic
reidentification is therefore worthless, and claims that it was only the
widely publicized `auxiliary information' about William Weld's health
status that made such reidentification possible.

What the author fails to address is that the availability of such
auxiliary information is exactly what is being made available with
greater and greater frequency by the release of poorly-anonymised
databases. As such, whilst the initial reidentification cannot be made
with perfect accuracy, subsequent pieces of auxiliary information can be
used to verify, research and identify an individual. (Of course, an
attacker may simply be seeking to gain a given piece of sensitive
information, so a true `reidentification' may not be a useful goal in
considering the risks of such databases.)

The author states in the abstract that `... most re-identification
attempts face a strong challenge in being able to create a complete and
accurate population register', and claims that this strong assumption
underlies most other reidentification work. (Using the entirely
objective phrase `somewhat furtive "insider" trade secret'.) In fact,
this strong assumption is entirely too strong, and is given as an
assumption only by the author themselves. I would point to the seminal
Shmatikov and Narayanan work on the Netflix Prize for a deeper analysis
that shatters exactly this kind of assumption. This claim by the author
is somewhat of a strawman argument, and one on which the entire paper is
based.

A second flaw comes in switching several times, according to the
argument needed, as to whether the attacker is interested in identifying
a targeted individual (`We need William Weld's data'), or whether any
individual will do (`We need someone's data, but don't care who it
is.'). These raise very different problems, and different sets of
statistics, and need to be clearly separated in analysis.

A third flaw, related to the first and epitomised by the section
starting with the final paragraph of page 6, is that an attacker would
need to somehow build their perfect database before reidentifying an
individual. The author states that the attacker would have to check all
other individuals outside of the original database to complete the
reidentification. In fact, they could simply seek alternative forms of
auxiliary information to make their reidentification more and more
certain. I do find it bizarre that the author makes this claim, as the
more intelligent approach of using auxiliary information is precisely
that employed by Sweeney in the case of William Weld.

The author does address the problem of probabilistic reidentification at
the latter stages of the paper (top of page 9) but dismisses it
entirely, and unreasonably, out of hand. I could write a whole essay on
this particular argument, but I'll simply note that with a 35% chance
for error, you simply have a very good starting point to find extra
auxiliary information to reduce your error to whatever you decide is
acceptable. (This should not be ignored, however, as the author's
insistence that reidentification must be 100% certain is probably the
deepest flaw here.)

A more worrying problem comes in the surprising lack of coverage of any
of the subsequent, and equally highly publicized, reidentification
attacks, or any of the developments in anonymisation since k-anonymity.
Even if we brush aside the vast amount of work on differential privacy,
which is extremely popular in anonymity research today, the author has
not addressed concepts such as l-diversity or t-closeness, which would
seem necessary for a reasonable study.

(As a quick example, consider this application of an l-diversity
problem: We cannot identify William Weld uniquely in the health
database, but we can isolate him as one of four people. All of those
four have been prescribed antidepressants in the last six months, and
three are being treated for an STD. No perfect reidentification, but
certainly a sensitive data leak for the poor governor.)

The total lack of coverage of, for example, Shmatikov and Narayanan's
reidentification of the Netflix Prize dataset, and the (wonderful)
analysis and methodology used there show a worrying lack of familiarity
with the state of the art, and certainly call into question the
conclusions drawn from the author's analysis.

I do find the total focus on the Sweeney example, and the picking apart
of the details, a very concerning example of the kind of thinking that
often surrounds anonymisation: that by fixing the specific problem that
you identify with a specific example, you can fix the wider problem.
This is a `patching up the holes' approach, rather than an attempt to
systemically fix a problem; this has rarely been shown an effective
strategy, particularly in computer security. ("This was caused by a
combination of gender, birthdate and zip code? Quick, make those
sensitive pieces of data!")

The recommendations at the end of the paper are simply unrealistic.
Point by point:

1) Make it illegal to reidentify data -- this approach has been
criticised at length in the literature, as the author acknowledges and
dismisses, but I would focus particularly here on how difficult it is to
detect reidentification attempts. This will stop only the most ethical
of attackers.

2) Require anyone linking in new data to maintain anonymity --
recognizes the problem of auxiliary information, but somehow ignores it
at the same time.

3) Give data `anonymous' status, but allow that status to be withdrawn
-- I assume that all the copies of the dataset will automatically
self-destruct once this status is withdrawn.

4) Specify that recipients must comply with restrictions -- if you can
state this then you have already solved most of the world's problems.
More seriously, this (and other recommendations here) seem to conflate
anonymisation that is shared with trusted researchers, which /is/ less
of a problem, with anonymisation that is released to the public. If you
are restricting access, there are a lot of extra approaches that you can
employ.

This is extremely important to understand, as the public release of data
continually combines to provide more and more auxiliary data. This is
why it is critically important that data for public release is
anonymised, as there is no realistic way to pull that data back once it
is in the public domain. All information is auxiliary information for
the next attack.

5) Require that data holders are secure -- again, this is a fine wish,
but gives nothing practical.

6) Data use agreements that pass on to further recipients -- trust is
not commutative, and this holds most of the same wishful thinking
problems as the other recommendations here.

All of these recommendations are based on an assumption of trust, good
faith and playing by the rules. In short, entirely the opposite of
conventional security-based thinking. While we shouldn't throw away
everything to meet some puritanical ideal of security, we shouldn't
ignore an entire field of study because we don't like their conclusions.

I don't entirely dismiss the need for a regulatory approach to this. In
fact, several of these recommendations are reasonable if combined with
other, stronger, guarantees. There should be penalties for misuse of
data, or poor anonymisation, but they should be backed up at the
technical level by effective techniques that can safeguard information.

More importantly, none of these recommendations provide any kind of
practical or constructive approach to best practice for anonymising
data, or how to weigh up the risk or effects of data release. This seems
to follow the overall tone of the paper that these risks are not a
concern.

The final conclusions of the paper are that the Sweeney example was not
representative, and I agree; I also wholly disagree with almost all of
the analysis and conclusions of the paper. From the choice use of
language regarding, particularly, the `somewhat furtive "insider" trade
secret', the author clearly believes that researchers into
reidentification are massively and knowingly overplaying the chances of
reidentification. I resent that.

The one point on which I do agree is that there needs to be a balance
between the benefits of access to large-scale databases, and the risks
of reidentification. Where that point of balance should be is, I think,
something on which I would strongly disagree with the author; although
perhaps not as much as one might think.

I do fully appreciate that the author comes from the perspective of
wanting to use data for the greater good, and that some claims of the
risks of database release are overly cautious. This paper, though,
massively overstates the difficulties, and massively understates the
risks.

We should have a better understanding of the actual risks of
reidentification, and weigh this against the benefits from access to
aggregate personal data. The way to do this, however, is in a
broad-based study of the real-world risks, research into the means for
reidentification and anonymisation, and a systemic approach to the
protection of personal data; not by hand-waving away the risks by
picking apart one unrepresentative example and ignoring the subsequent
decade of active research into the area.

Happy to answer any other questions, on- or off-list.

Joss