DNA is crucial for life, and its organization has been a significant scientific challenge. GROVER, a model developed by BIOTEC, decodes DNA like text, promising advancements in genomics and personalized medicine.
DNA holds the essential information required to sustain life. Deciphering how this information is stored and organized has been one of the greatest scientific challenges of the past century. Now, with GROVER, a new large language model trained on human DNA, researchers can attempt to decode the intricate information concealed within our genome. Developed by a team at the Biotechnology Center (BIOTEC) of Dresden University of Technology, GROVER treats human DNA as text, learning its rules and context to extract functional information about DNA sequences. Published in Nature Machine Intelligence, this innovative tool has the potential to revolutionize genomics and accelerate personalized medicine.
Since the discovery of the double helix, scientists have sought to understand the information encoded in DNA. 70 years later, it is clear that the information hidden in the DNA is multilayered. Only 1-2 % of the genome consists of genes, the sequences that code for proteins.
“DNA has many functions beyond coding for proteins. Some sequences regulate genes, others serve structural purposes, and most sequences serve multiple functions at once. Currently, we don’t understand the meaning of most of the DNA. When it comes to understanding the non-coding regions of the DNA, it seems that we have only started to scratch the surface. This is where AI and large language models can help,” says Dr. Anna Poetsch, research group leader at the BIOTEC.
DNA as a Language
Large language models, like GPT, have transformed our understanding of language. Trained exclusively on text, the large language models developed the ability to use the language in many contexts.
“DNA is the code of life. Why not treat it like a language?” says Dr. Poetsch. The Poetsch team trained a large language model on a reference human genome. The resulting tool named GROVER, or “Genome Rules Obtained via Extracted Representations”, can be used to extract biological meaning from the DNA.
“GROVER learned the rules of DNA. In terms of language, we are talking about grammar, syntax, and semantics. For DNA this means learning the rules governing the sequences, the order of the nucleotides and sequences, and the meaning of the sequences. Like GPT models learning human languages, GROVER has basically learned how to ‘speak’ DNA,” explains Dr. Melissa Sanabria, the researcher behind the project.
The team showed that GROVER can not only accurately predict the following DNA sequences but can also be used to extract contextual information that has biological meaning, e.g., identify gene promoters or protein binding sites on DNA. GROVER also learns processes that are generally considered to be “epigenetic”, i.e., regulatory processes that happen on top of the DNA rather than being encoded.
“It is fascinating that by training GROVER with only the DNA sequence, without any annotations of functions, we are actually able to extract information on biological function. To us, it shows that the function, including some of the epigenetic information, is also encoded in the sequence,” says Dr. Sanabria.
The DNA Dictionary
“DNA resembles language. It has four letters that build sequences and the sequences carry a meaning. However, unlike a language, DNA has no defined words,” says Dr. Poetsch. DNA consists of four letters (A, T, G, and C) and genes, but there are no predefined sequences of different lengths that combine to build genes or other meaningful sequences.
To train GROVER, the team had to first create a DNA dictionary. They used a trick from compression algorithms. “This step is crucial and sets our DNA language model apart from the previous attempts,” says Dr. Poetsch.
“We analyzed the whole genome and looked for combinations of letters that occur most often. We started with two letters and went over the DNA, again and again, to build it up to the most common multi-letter combinations. In this way, in about 600 cycles, we have fragmented the DNA into ‘words’ that let GROVER perform the best when it comes to predicting the next sequence,” explains Dr. Sanabria.
The Promise of AI in Genomics
GROVER promises to unlock the different layers of genetic code. DNA holds key information on what makes us human, our disease predispositions, and our responses to treatments.
“We believe that understanding the rules of DNA through a language model is going to help us uncover the depths of biological meaning hidden in the DNA, advancing both genomics and personalized medicine,” says Dr. Poetsch.
Reference: “DNA language model GROVER learns sequence context in the human genome” by Melissa Sanabria, Jonas Hirsch, Pierre M. Joubert and Anna R. Poetsch, 23 July 2024, Nature Machine Intelligence.
DOI: 10.1038/s42256-024-00872-0

News
Turning Off Nerve Signals: Scientists Develop Promising New Pancreatic Cancer Treatment
Pancreatic cancer reprograms nerve cells to fuel its growth, but blocking these connections can shrink tumors and boost treatment effectiveness. Pancreatic cancer is closely linked to the nervous system, according to researchers from the [...]
New human antibody shows promise for Ebola virus treatment
New research led by scientists at La Jolla Institute for Immunology (LJI) reveals the workings of a human antibody called mAb 3A6, which may prove to be an important component for Ebola virus therapeutics. [...]
Early Alzheimer’s Detection Test – Years Before Symptoms Appear
A new biomarker test can detect early-stage tau protein clumping up to a decade before it appears on brain scans, improving early Alzheimer’s diagnosis. Unlike amyloid-beta, tau neurofibrillary tangles are directly linked to cognitive decline. Years [...]
New mpox variant can spread rapidly across borders
International researchers, including from DTU National Food Institute, warn that the ongoing mpox outbreak in the Democratic Republic of the Congo (DRC) has the potential to spread across borders more rapidly. The mpox virus [...]
How far would you trust AI to make important decisions?
From tailored Netflix recommendations to personalized Facebook feeds, artificial intelligence (AI) adeptly serves content that matches our preferences and past behaviors. But while a restaurant tip or two is handy, how comfortable would you [...]
Can AI Really Think? Research Reveals Gaps in Logical Execution
While AI models can break down problems into structured steps, new research reveals they still fail at basic arithmetic and fact-checking—raising questions about their true reasoning abilities. Large Language Models (LLMs) have become indispensable [...]
Scientists Just Made Cancer Radiation Therapy Smarter, Safer, and More Precise
Scientists at UC San Francisco have developed a revolutionary cancer treatment that precisely targets tumors with radiation while sparing healthy tissues. By using a KRAS-targeting drug to mark cancer cells and attaching a radioactive [...]
Superbugs Are Losing to Science, Light, and a Little Spice
Texas A&M researchers have found that curcumin, when activated by light, can weaken antibiotic-resistant bacteria, restoring the effectiveness of conventional antibiotics. Curcumin: A Surprising Ally Against Superbugs In 2017, a woman admitted to a [...]
New Research Shatters the Perfect Pitch Myth
For decades, people believed absolute pitch was an exclusive ability granted only to those with the right genetics or early music training. But new research from the University of Surrey proves otherwise. It’s been [...]
Why Some Drinkers Suffer Devastating Liver Damage While Others Don’t
A study from Keck Medicine of USC found that heavy drinkers with diabetes, high blood pressure, or a large waistline are up to 2.4 times more likely to develop advanced liver disease. These conditions may amplify [...]
“Good” Cholesterol Could Be Bad for Your Eyes – New Study Raises Concerns
‘Good’ cholesterol may be linked to an increased risk of glaucoma in individuals over 55, while, paradoxically, ‘bad’ cholesterol may be associated with a lower risk. These findings challenge conventional beliefs about factors that [...]
Reawakening Dormant Nerve Cells: Groundbreaking Neurotechnology Restores Motor Function
A new electrical stimulation therapy for spinal muscle atrophy (SMA) has shown promise in reactivating motor neurons and improving movement. In a pilot clinical trial, three patients who received spinal cord stimulation for one [...]
AI’s Energy Crisis Solved? A Revolutionary Magnetic Chip Could Change Everything
AI is evolving at an incredible pace, but its growing energy demands pose a major challenge. Enter spintronic devices—new technology that mimics the brain’s efficiency by integrating memory and processing. Scientists in Japan have [...]
Nanotechnology for oil spill response and cleanup in coastal regions
(Nanowerk News) Cleaning up after a major oil spill is a long, expensive process, and the damage to a coastal region’s ecosystem can be significant. This is especially true for the world’s Arctic region, [...]
The Role of Nanotechnology in Space Exploration
Nanotechnology, which involves working with materials at the atomic or molecular level, is becoming increasingly important in space exploration. By improving strength, thermal stability, electrical conductivity, and radiation resistance, nanotechnology is helping create lighter, more [...]
New Study Challenges Beliefs About CBD in Pregnancy, Reveals Unexpected Risks
CBD is gaining popularity as a remedy for pregnancy symptoms like nausea and anxiety, but new research suggests it may not be as safe as many believe. A study from McMaster University found that [...]