[liberationtech] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

Thu Feb 6 12:49:14 PST 2014

We've been kicking around an idea at Sunlight that aims to use
cryptographic ideas to resolve some of the concerns around the publication
of publicly identifiable information in government disclosures. I could use
some smart people to tell me what's dumb about it.

We often face challenges related to disambiguating entities: is the John
Smith who gave political donation A the same John Smith that gave political
donation B? One obvious solution to this problem is to push to expand the
information that's collected and disclosed -- if we had John's driver's
license number (DLN), for instance, it'd be easy to disambiguate these
records. But that could introduce privacy concerns for John. One approach
to this problem (which I don't think government has tried) is employing a
one-way hash.

Obviously the input key space for DLNs and most other personal ID numbers
is so small that reversing this with a dictionary attack would be trivial.
You can add a salt, but only on a per-entity basis (not a per-record basis)
if you want to preserve the capacity to disambiguate. That in turns calls
for a lookup table in which the input keys are stored, which kind of
defeats the point of using a hash (you might as well just assign random
output IDs for each input ID). I would worry about government's ability to
keep this lookup table secure, and I worry about the brittleness of such a
system.

Alternately, you can use a single system-wide secret (or set of secrets) to
transform inputs into reliable outputs. I think this is less brittle and
maybe easier to preserve as a secret, but this system might be too easily
reversible given the ability to observe its outputs and know the universe
of possible inputs. I'm unsure of the cryptographic options that might be
appropriate here.

For all I know, the lack of implementations using this kind of one-way
transformation isn't about government sluggishness but rather about its
feasibility. I'd be very curious to hear folks ideas on this score, though.
 My general hunch is that something must be possible -- even a few bits'
worth of disambiguating information would be hugely useful to us, and
presumably you're not leaking important amounts of information by, say,
sharing the last digit of a DLN. So there must be a spectrum of options.
But as is probably apparent, I don't think I've got a handle on how to
think about this problem rigorously.

Tom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20140206/0c46d742/attachment.html>