New AI Tool Searches Genetic Haystacks to Find Disease-Causing Variants

Eurekalert! photo by Marcelo Santana
The PrimateAI-3D algorithm is trained on genomes from 233 primate species, including the Humboldt’s squirrel monkey, like the ones seen here in Mamirauá, Brazil.

Scientists have developed a way to sift through millions of differences in a person’s genetic blueprint to detect those that threaten our health, and have tested the new tool on a biomedical database of more than 450,000 people in the United Kingdom, according to a series of papers published Thursday in the journal Science.

The research marks a critical step toward harnessing the full power of the genome for medicine, and it demonstrates a new way artificial intelligence can be applied to problems in human health, experts said.

One problem that has frustrated doctors for years stems from the fact that although we are 99.6 percent similar at the level of DNA, each of us has an average of 4 million variants, sections of the genetic code where we differ from one another.

“It has been extremely difficult to determine which ones cause disease and which ones don’t,” said Kyle Farh, vice president of artificial intelligence at the San Diego-based biotechnology company Illumina.

Farh and an international team of almost 100 researchers created an algorithm designed to help medicine clear up some of the uncertainty. “We are aiming to eliminate variants of unknown significance, which is the main barrier to unlocking the value of genomic medicine,” Farh said.

Just as ChatGPT can learn how to predict human speech by having engineers feed it a wealth of text, the new algorithm has been trained to make medical predictions based on reading genomes.

The scientists built the algorithm, called PrimateAI-3D, using the genetic blueprints of 233 different primate species. This base brings into sharp relief the variants that can be tolerated by primates, including humans, and those that prove deadly. Scientists look for places where the sequence is the same from one primate to another, a clear sign that any change is disastrous.

“It’s a brilliant idea. As soon as I read the paper, I sent it to my team and said, ‘We’ve got to get on this,'” said Stephen Kingsmore, president and CEO of Rady Children’s Institute for Genomic Medicine, a facility based in San Diego that decodes the genomes of 1,000 families a year for 90 hospitals across the United States.

Kingsmore said that in about one-quarter of cases, doctors sequence a patient’s genome only to find a variant with an unknown impact on health.

“We’re doing them a great disservice,” he said. “Parents kind of throw up their hands and say, ‘Does the child have a disease or not?’ and we can only say, ‘Maybe.'”

Until now, hospitals examining genetic variants in their patients have often consulted a large archive called ClinVar. The new PrimateAI-3D algorithm scans about 70 million genetic variants, a selection that is more than 1,000 times as large as ClinVar, Farh said.

The 3D in the name refers to the three-dimensional structure of proteins, a key factor in distinguishing which mutations will wreak havoc. Many diseases are caused by mutations that harm a protein or cause the body to make too much or too little of it.

It remains unclear how much of a difference the algorithm will make in the course of day-to-day medicine, “but they do show it outperforms anything we have currently,” said Bruce Gelb, director of the Mindich Child Health and Development Institute at Icahn School of Medicine at Mount Sinai.

Gelb, who was not part of the study team, said he had seen a previous version of the algorithm described in Nature Genetics in 2018. The earlier version was based on just six species of nonhuman primates, as opposed to the 233 primate species in the new version. “That’s a very large increase, and gives it much more statistical power to find things,” Gelb said.

Matthew Lebo, who directs the Laboratory for Molecular Medicine at Mass General Brigham, said that PrimateAI-3D won’t eliminate the problem of finding variants of unknown significance, but it will help doctors to prioritize the variants they are investigating for a specific disease.

The new tool should also help pharmaceutical companies in their search for new drugs. Clinical trials often fail because the gene scientists are targeting is “incorrect, and not relevant to disease,” Farh said. “Using AI and genomics to select the right targets should significantly reduce the rate of late-stage clinical trial failures.”

Illumina said it will make the new tool broadly available in future releases of its software products.

By testing the new algorithm on hundreds of thousands of patient genomes in UK Biobank, “we found that 97 percent of the general population carries a rare variant” that has some kind of significant affect on health, Farh said. Although the algorithm cannot account for the influence of diet and environmental factors, he explained, “we can basically predict people’s levels of cholesterol and glucose, and hence their risks for cardiovascular disease or diabetes, from the genome by predicting the affects of these variants.”

Kingsmore said that genome science “has been forcing medicine into artificial intelligence” for years because of the sheer size of our genetic blueprint. A genome is a long code written in four letters: A, T, G, C. Each letter stands for one of the four chemical bases from which our DNA is built: adenine, thymine, guanine and cytosine. One full genome is like a ladder containing roughly 3 billion steps, with a pair of letters at each one.

The National Institutes of Health estimates that genome sequencing is now generating up to 40 billion gigabytes of data each year, the equivalent of roughly 10 million full genomes.

“The reason artificial intelligence is such a good fit,” he said, “is that the medical workforce is so ill-prepared” to pull answers from such an ocean of data.