Recently a paper in Science revealed yet another thing that genomicists already know. Namely, that “de-identified” genetic data can still be used identify individuals.
How? By trolling publicly available databases, in this case a combination of the 1,000 Genomes Project, databases maintained by genetic genealogists* and other freely available data sites. With internet access, a few key pieces of information and some resourceful thinking, computer expert/human geneticist Yaniv Erlich of the Whitehead Institute (M.I.T.) identified specific men by their genomic data. And it was easy, at least for someone with his mix of computing and genetics expertise.
My first response was a sigh. Then wait for it . . . and yes, minutes later the headline topped both the Health and Science sections on the New York Times’ website: Web Hunt for DNA Sequences Leaves Privacy Compromised. And If your genome is public, so are you, researchers find appeared in the L.A. Times. And Donated genetic data ‘privacy risk’ in BBC News Health. And Your Genetic Secrets May Not Be as Safe as You Think on US News & World Report’s HealthDay site.
You get the idea.
Interestingly, I don’t really have a problem with the coverage contained in the articles. Erlich is an open data advocate who handled his findings with great care, and he is quoted repeatedly as saying he hopes his paper doesn’t chill data sharing but instead improves data management practices. As he and his collaborators conclude in the original paper:
. . . in our view, the appropriate response to genetic privacy challenges is not for the public to stop donating samples or for data sharing to stop. These would be devastating reactions that could substantially hamper scientific progress. Rather, we believe that establishing clear policies for data sharing, educating participants about the benefits and risks of genetic studies, and the legislation of proper usage of genetic information are pivotal ingredients to support the genomic endeavor.
Unfortunately, this point was typically left until last in the media reports as well. The overall up-front message was clear, and it was negative—providing genetic data is not “safe” and involves “risk.”
I had published a blog post advocating for broader genome education mere minutes before the Science embargo lifted, and here’s another reason why it’s so important. This paper furthers the vague sense that genome data is somehow scary and threatening, and its use conjures an odd mix of erroneous perceptions that it’s simultaneously medically irrelevant and somewhat dangerous. The inherent paradox would be amusing if it didn’t have the potential to, as the paper authors say, provoke “devastating reactions” based on fear of the unknown.
For the apparently small group that knows about the nuances of genome data, privacy is understood to be an ideal that, at the moment, is not practical. The Personal Genome Project’s consent form made the practical part of this very clear to me—in providing data that is, after all, the ultimate individual identifier in existence, complete de-identification is kind of laughable. But my personal response to this is a shrug. I will be public with my data, but if I was not it would take some doing to identify me from my genomic data alone, especially for anyone who is not an M.I.T. researcher with a rather rare mix of skills. And even if you are, what are you going to do? The truth is that there’s little definitive even an expert could find in my genome now that would be of use to anyone except me and perhaps my doctor.
The fear reaction reminds me of the early days of e-commerce, where so many people believed if they bought anything online with a credit card, the number would get hacked. But they’d hand that same credit card to a $5 an hour waiter, sign an original receipt that had a carbon, and think nothing of it. Obviously, the risk was far greater in the latter action, but it was familiar and therefore felt safer. Now, of course, Amazon’s sales totals prove that those fears have been quelled. One can hope the same thing happens for genomic data sharing.
I know that people with genetic diseases may feel differently, for very valid reasons. And as we get more sophisticated in our analysis, more robust security methods will become ever more imperative to protect patients from abuse from, say, employers, insurers and others who might wish to make them pay for “defects.” For that reason, Dr. Erlich’s research is an important contribution, and it will lead to discussion and refinement of policies and education. At the same time, right now we need to share genomic data, then share it more, if we wish to unlock its true potential to benefit humankind.
*After reading the original paper and many, many reports on it, I’ll confess I don’t know exactly what these genealogical database sites actually are, other than that they were generally characterized as “recreational.” It’s also unclear to me whether the individuals identified in the study donated data specifically to the genealogical databases in addition to the 1,000 Genome Project, or if data sharing led to the data being available for cross-reference.