[liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info
Josh Tauberer
tauberer at govtrack.us
Thu Feb 6 13:27:21 PST 2014
On 02/06/2014 03:49 PM, Tom Lee wrote:
> Obviously the input key space for DLNs and most other personal ID
> numbers is so small that reversing this with a dictionary attack would
> be trivial. You can add a salt, but only on a per-entity basis (not a
> per-record basis) if you want to preserve the capacity to
> disambiguate. That in turns calls for a lookup table in which the
> input keys are stored, which kind of defeats the point of using a hash
> (you might as well just assign random output IDs for each input ID). I
> would worry about government's ability to keep this lookup table
> secure, and I worry about the brittleness of such a system.
And yet a lookup table mapping inputs to random outputs might be the
best worst option.
Even if the right cryptographic method (hash, encryption, etc.) can be
found and is mathematically sound, I'd have /very/ low confidence that
it would be implemented correctly. Maybe one office does it right, the
next office says hey that's a great idea but forgets that hashing a four
digit pin doesn't provide any obscurity, etc. (That's not a jab at
government. Crypto is so hard.)
I'd ask, for a particular case, what data does the data source already
have? If they /already/ have DLNs in their database, there's no added
privacy concern in creating a random mapping to unique identifiers for
public consumption. (Besides the mosaic effect, but that aside.)
Assuming the data source can make the distinction at all internally,
they must have /something/ already in their database.
HTH,
- Josh Tauberer (@JoshData)
http://razor.occams.info
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20140206/ade02ae0/attachment.html>
More information about the liberationtech
mailing list