[liberationtech] [sunlightlabs] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info

Thu Feb 6 13:27:21 PST 2014

On 02/06/2014 03:49 PM, Tom Lee wrote:
> Obviously the input key space for DLNs and most other personal ID 
> numbers is so small that reversing this with a dictionary attack would 
> be trivial. You can add a salt, but only on a per-entity basis (not a 
> per-record basis) if you want to preserve the capacity to 
> disambiguate. That in turns calls for a lookup table in which the 
> input keys are stored, which kind of defeats the point of using a hash 
> (you might as well just assign random output IDs for each input ID). I 
> would worry about government's ability to keep this lookup table 
> secure, and I worry about the brittleness of such a system.

And yet a lookup table mapping inputs to random outputs might be the 
best worst option.

Even if the right cryptographic method (hash, encryption, etc.) can be 
found and is mathematically sound, I'd have /very/ low confidence that 
it would be implemented correctly. Maybe one office does it right, the 
next office says hey that's a great idea but forgets that hashing a four 
digit pin doesn't provide any obscurity, etc. (That's not a jab at 
government. Crypto is so hard.)

I'd ask, for a particular case, what data does the data source already 
have? If they /already/ have DLNs in their database, there's no added 
privacy concern in creating a random mapping to unique identifiers for 
public consumption. (Besides the mosaic effect, but that aside.) 
Assuming the data source can make the distinction at all internally, 
they must have /something/ already in their database.

HTH,

- Josh Tauberer (@JoshData)

http://razor.occams.info

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20140206/ade02ae0/attachment.html>