The genetic data of more than 1,000 people from around the world seemed stripped of anything that might identify them individually. All that was posted online was research data, the ages of the individuals and the region where each of them lived.
But when a researcher randomly selected the DNA sequences of five people in the database, he not only figured out who they were but also identified their entire families, although the relatives had no part in the study. His foray ended up breaching the privacy of nearly 50 people.
And all it took was triangulation, using the genetic data, a genealogy website and Google searches. The methods for extracting relevant genetic data from the raw sequence files were specialized enough to be beyond the scope of most laypeople, but no one expected it would be so easy to zoom in on individuals.
We are in what I call an awareness moment, said Eric D. Green, director of the National Human Genome Research Institute at the National Institutes of Health.
The researcher did not publish the names he found. But the exercise revealed a growing tension between the advancement of medical research, which often requires making genetic information public so scientists can use it, and protecting the privacy of study subjects.
The paper, published Thursday in the journal Science, follows other reports that identified people whose genetic data were online. But none had started with such limited information: just the long string of DNA letters, an age and, because the study focused on only U.S. subjects, a state.
Ive been worried about this for a long time, said Barbara Koenig, a researcher at the University of California, San Francisco, who studies issues involving genetic data. The new paper is amazing, she said.
The project was the inspiration of Yaniv Erlich, a human genetics researcher at the Whitehead Institute, which is affiliated with MIT. He stresses that he is a strong advocate of data sharing and would hate to see genomic data locked up. But when his lab developed a new technique, he realized he had the tools to probe a DNA database.
The tool allowed him to find a type of DNA pattern that looks like stutters among billions of chemical letters in human DNA. Those little stutters short tandem repeats are inherited.
Genealogy websites use repeats on the Y chromosome, the one unique to men, to identify men by their surnames, an indicator of ancestry. Any man can submit the short tandem repeats on his Y chromosome and find the surname of men with the same DNA pattern. The sites enable men to find their ancestors and relatives.
So, Erlich asked, could he take a mans entire DNA sequence, pick out the short tandem repeats on his Y chromosome, search a genealogy site, discover the mans surname and then fully identify the man?
He tested it with the genome of Craig Venter, a DNA sequencing pioneer who posted his own DNA sequence on the Web. He knew Venters age and the state where he lives. Two men popped up in the database. One was Craig Venter.
Out of 300 million people in the United States, we got it down to two people, Erlich said.
He and his colleagues calculated they would be able to identify, from just their DNA sequences, the last names of about 12 percent of middle class and wealthier white men the population that tends to submit DNA data to genealogical sites. By combining the mens last names with their ages and the states where they lived, the researchers should be able to narrow their search to a few likely individuals.
On the Web and publicly available are DNA sequences from subjects in an international collaboration, the 1000 Genomes Project. Peoples ages were included and all the Americans lived in Utah.
Erlich began with one man. He got the Y chromosomes short tandem repeats and went to genealogy databases and searched for men with those same repeats. He got surnames of the paternal and maternal grandfather. Then he did a Google search for those people and found an obituary. That gave him the family tree.
Oh, my God, we really did this, Erlich said.
Amy L. McGuire, a lawyer and ethicist at Baylor College of Medicine, said that to have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position.