The Debate over De-Identified Data: When Anonymity Isn’t Assured

Not necessarily about writing or publishing, but an interesting 21st-century issue.

From Legal Tech News

As more algorithm-coded technology comes to market, the debate over how individuals’ de-identified data is being used continues to grow.

A class action lawsuit filed in a Chicago federal court last month highlights the use of sensitive de-identified data for commercial means. Plaintiffs represented by law firm Edelson allege the University of Chicago Medical Center gave Google the electronic health records (EHR) of nearly all of its patients from 2009 to 2016, with which Google would create products. The EHR, which is a digital version of a patient’s paper chart, includes a patient’s height, weight, vital signs and medical procedure and illness history.

While the hospital asserted it did de-identify data, Edelson claims the hospital included date and time stamps and “copious” free-text medical notes that, combined with Google’s other massive troves of data, could easily identify patients, in noncompliance with the Health Insurance Portability and Accountability Act (HIPAA).

. . . .

“I think the biggest concern is the quantity of information Google has about individuals and its ability to reidentify information, and this gray area of if HIPAA permits it if it was fully de-identified,” said Fox Rothschild partner Elizabeth Litten.

Litten noted that transferring such data to Google, which has a host of information collected from other services, makes labeling data “de-identified” risky in that instance. “I would want to be very careful with who I share my de-identified data with, [or] share information with someone that doesn’t have access to a lot of information. Or [ensure] in the near future the data isn’t accessed by a bigger company and made identifiable in the future,” she explained.

If the data can be reidentified, it may also fall under the scope of the European Union’s General Data Protection Regulation (GDPR) or California’s upcoming data privacy law, noted Cogent Law Group associate Miles Vaughn.

Link to the rest at Legal Tech News

De-identified data is presently an important component in the development of artificial intelligence systems.

As PG understands it, a large mass of data concerning almost anything, but certainly including data about human behavior, is dumped into a powerful computer which is tasked with discerning patterns and relationships within the data.

The more data regarding individuals that goes into the AI hopper, the more can be learned about groups of individuals and relationships between individuals or behavior patterns of individuals that may not be generally known or discoverable by other, more traditional methods of data analysis and the resultant learning such analysis generates.

As a crude example based upon the brief description in the OP, an artificially intelligent system that had access to the medical records described in the OP and also the usage records for individuals using Ventra cards (contactless digital payment cards that are electronically scanned) on the Chicago Transit Authority could conceivably identify a specific individual associated with an anonymous medical record by correlating Ventra card use at a nearby transit stop with the time stamps on the digital medical record entries.

11 thoughts on “The Debate over De-Identified Data: When Anonymity Isn’t Assured”

  1. ‘De-identified’ means they can’t tell who you are.

    Re-identified means that, by putting together all the pieces of the data, they now know everything available in that data and can assign it to an INDIVIDUAL.

    It’s no longer anonymized data. And it can happen dizzyingly quickly with the amount and kinds of data available. And the companies which sell that data don’t care. They can say ‘we did our job,’ KNOWING it will not stay anonymous when correlated with other data.

    • Never mind those companies having links on most pages to see where you go even when you don’t have their site open (like that little row of icons just above your comment.

      And it’s depressing how many sites that you can’t even sign in or change pages on if you’ve blocked certain Google scripts/services.

      • Actually, it is starting to be a plus for FireOS that it doesn’t use Google services. I have an older 8.9in FireHD and a bunch of apps gripe they can’t find google services…and run just fine anyway.

        • I’m talking about websites that won’t fully load unless you allow places like and to run their little scripts (running NoScript) .

            • That DAZ 3D Studio I use is what I used as an example. No problems using Chrome as Google won’t block itself out of anything, but Firefox with NoScript running can’t even log the user in.

  2. Of course, none of this is actually new. These are the same methods that Russian secret police used a century and a half ago to find anti-tsarist agitators. They’re just more effective and efficient* now, and not limited to anti-tsarist agitators as the targets.

    To me, that’s what is most disturbing. “Data analytics” is military “traffic analysis” done by private parties… and so on. It’s just more-attractive marketing-speak for those who have power deciding that the privacy interests of the nameless masses are less important than, well, more power.

    * This is a snide attack on the law-and-economics movement’s worship of “efficiency” as an unmitigated good thing.

    • Traffic analysis? One of the most fascinating stories out of WWII was how the British cracked the German Enigma code. One key element was knowing that lots of their encoded messages started with “Heil Hitler.”

      • Although that’s an interesting note, it’s not traffic analysis. I can recommend an unclassified book (The Codebreakers by David Kahn) that discusses these sorts of things.

  3. Man, this really makes me angry. U of C is one of the top tier systems in the Chicago healthcare market. And this is a major, major breach of HIPAA, for which they should pay an enormous fine. During my 45 year career in healthcare, now over, we could get fired if we so much glanced at our own medical records. This breach is many layers worse.

  4. Twenty or thirty years ago scrubbing the identity off data was easy. My group made deals with customers all the time in which the customers handed over the data in their help desk systems to us after we showed them how to scrub out the PII (Personally Identifiable Information). Then we would use the scrubbed data to test new features and figure out how we could improve, much like Google uses data as training sets for AI products.

    Scrubbing was easy– irreversibly hash names, addresses, phone numbers, etc. and you were done. For us, that all changed about ten years ago when a graduate research group borrowed some of our PII scrubbed data to see what they could do with it. We gave them binary dumps to play with. The dumps were pretty ugly– no schema, no structure, just terabytes of zeroes and ones. I didn’t think much about it until they came back with a reconstructed list of user names and addresses that they had sucked out of the dumps using pattern recognition algorithms cross-referenced with a few publicly available databases.

    I immediately had a long talk with our Safe Harbor lawyers and we quietly deleted terabytes data we thought was squeaky clean of PII.

    We had underestimated the capacity of brute computing power to find and use correlations that are far from obvious.

    That was at least 10 years ago. There are orders of magnitude more computing capacity available today and the algorithms are now far more sophisticated. Also, the cows are out of the barn– the amount of public data for correlation has skyrocketed with the rise of cheap storage.

    I venture that removal of PII is now impossible. Period. Whether we like it or not, our lives are now much more public than we think. We have no alternative to learning to live with it.

Comments are closed.