By harnessing advanced AI, MethylGPT decodes DNA methylation with unprecedented accuracy, offering new paths for age prediction, disease diagnosis, and personalized health interventions.
In a recent study posted to the bioRxiv preprint* server, researchers developed a transformer-based foundation model, MethylGPT, for the DNA methylome.
DNA methylation is a type of epigenetic modification that regulates gene expression via methyl-binding proteins and changes in chromatin accessibility. It also helps maintain genomic stability through transposable element repression. DNA methylation has features of an ideal biomarker, and studies have revealed distinct methylation signatures across pathological states, allowing for molecular diagnostics.
Nevertheless, several analytic challenges impede the implementation of diagnostics based on DNA methylation. Current approaches rely on simple statistical and linear models, which are limited in capturing complex, non-linear data. They also fail to account for context-specific effects such as higher-order interactions and regulatory networks. Therefore, a unified analytical framework that can model complex, non-linear patterns in various tissue and cell types is urgently needed.
Recent advances in foundation models and transformer architectures have revolutionized analyses of complex biological sequences. Foundation models have also been introduced for various omics layers, such as AlphaFold3 and ESM-3 for proteomics and Evo and Enformer for genomics. The achievements of the foundation models suggest that DNA methylation analyses could be transformed with a similar approach.
The study and findings
In the present study, researchers developed MethylGPT, a transformer-based foundation model for the DNA methylome. First, they acquired data on 226,555 human DNA methylation profiles spanning multiple tissue types from the EWAS Data Hub and Clockbase. Following deduplication and quality control, 154,063 samples were retained for pretraining. The model focused on 49,156 CpG sites, which were selected based on their known associations with various traits, as this would maximize their biological relevance.
The model was pre-trained using two complementary loss functions: masked language modeling (MLM) loss and profile reconstruction loss, enabling it to accurately predict methylation at masked CpG sites. The model achieved a mean squared error (MSE) of 0.014 and a Pearson correlation of 0.929 between predicted and actual methylation levels, indicating high predictive accuracy. Researchers also evaluated whether the model could capture biologically relevant features of DNA methylation. As such, they analyzed the learned representations of CpG sites in the embedding space.
They found that CpG sites clustered based on their genomic contexts, suggesting that the model learned the regulatory features of the methylome. In addition, there was a clear separation between autosomes and sex chromosomes, indicating that MethylGPT also captured higher-order chromosomal features. Next, the team analyzed zero-shot embedding spaces. This showed a clear biological organization, clustering by sex, tissue type, and genomic context.
Major tissue types formed well-defined clusters, indicating that the model learned methylation patterns specific to tissues without explicit supervision. Notably, MethylGPT also avoided batch effects, which often confound results in complex datasets. Besides, female and male samples demonstrated consistent separation, reflecting sex-specific differences. Next, the researchers assessed the ability of MethylGPT to predict chronological age from methylation patterns. To this end, they used a dataset of over 11,400 samples from diverse tissue types.
Fine-tuning for age prediction led to robust age-dependent clustering. Notably, intrinsic age-related organization was evident even before fine-tuning. Moreover, MethylGPT outperformed existing age prediction methods (e.g., Horvath’s clock and ElasticNet), achieving superior accuracy. Its median absolute error for age prediction was 4.45 years, further demonstrating its robustness. MethylGPT was also remarkably resilient to missing data. It exhibited stable performance with up to 70% missing data, outperforming multi-layer perceptron and ElasticNet approaches.
Analysis of methylation profiles during induced pluripotent stem cell (iPSC) reprogramming showed a clear rejuvenation trajectory; samples progressively transitioned to a younger methylation state over the course of reprogramming. The model was also able to identify the point during reprogramming (day 20) when cells began showing clear signs of epigenetic age reversal. Finally, the model’s ability to predict disease risk was assessed. The pre-trained model was fine-tuned to predict the risk of 60 diseases and mortality. The model achieved an area under the curve of 0.74 and 0.72 on validation and test sets, respectively.
In addition, they used this disease risk prediction framework to evaluate the impact of eight interventions on predicted disease incidence. Interventions included smoking cessation, high-intensity training, and the Mediterranean diet, among others, each of which showed varying degrees of effectiveness across disease categories. This showed distinct intervention-specific effects across disease categories, highlighting the potential of MethylGPT in predicting intervention-specific outcomes and optimizing tailored intervention strategies.
Conclusions
The findings illustrate that transformer architectures could effectively model DNA methylation patterns while preserving biological relevance. The organization of CpG sites based on regulatory features and genomic context suggests that the model captured fundamental aspects without explicit supervision. MethylGPT also demonstrated superior performance in age prediction across different tissues. Moreover, its robust performance in handling missing data (≤ 70%) underscores its potential utility in clinical and research applications.
News
Contradictory Discovery: Our Innate Immune System May Fuel Cancer Development
MSK researchers discovered that the innate immune system’s chronic activation due to issues in the Mre11 complex can lead to cancer, highlighting new therapeutic targets. In addition to defending against pathogens, the body’s innate [...]
New study links circadian gene variants to winter depression
Findings suggest that PER3 gene variants prevent adrenal adaptation to winter daylight, leading to serotonin disruption and depression-like behaviors. A recent study in Nature Metabolism used humanized mice with modified PERIOD3 gene variants (P415A and H417R) [...]
Quantum Leap for MRI: Atomic Sensors Unlock New Imaging Potential
New atomic sensor technology enhances MRI quality control by tracking hyperpolarized molecules in real-time, with potential benefits for various scientific fields. Magnetic resonance imaging (MRI) is a fundamental tool in modern medicine, offering detailed [...]
MethylGPT unlocks DNA secrets for age and disease prediction
By harnessing advanced AI, MethylGPT decodes DNA methylation with unprecedented accuracy, offering new paths for age prediction, disease diagnosis, and personalized health interventions. In a recent study posted to the bioRxiv preprint* server, researchers developed a [...]
“Astonishing” – Scientists Unveil First Blueprint of the Most Complex Molecular Machine in Human Biology
Researchers unveil the inner mechanisms of the most intricate and complex molecular machine in human biology. Scientists at the Centre for Genomic Regulation (CRG) in Barcelona have developed the first comprehensive blueprint of the [...]
Breakthrough research reveals how to target malignant DNA in aggressive cancers
Scientists have discovered a way to target elusive circular fragments of DNA that drive the survival of some of the most aggressive cancers, paving the way for future treatments. In three groundbreaking papers published [...]
How bacteria trigger colon cancer
In a recent study published in Nature, scientists used murine models to investigate how certain bacteria, such as Escherichia coli strains that contain a polyketide synthase (pks) island encoding enzymes that produce colibactin genotoxin, could increase the [...]
Nanoparticles designed to trap and neutralise large amounts of SARS-CoV2
(Nanowerk News) Researchers from the IBB-UAB have developed a new class of nanostructures capable of trapping and neutralising large quantities of the SARS-CoV2 virus particles, both in liquid solutions and on the surface of [...]
Nanodiscs: What Are They and How Are They Shaping the Future of Medicine?
Nanodiscs are synthetic phospholipid particles with a distinct morphology and size that enhance their efficiency in drug delivery applications.1 First developed by Sligar et al. in the early 2000s, these model membrane systems measure around 10 [...]
New Discovery Reveals How Ovarian Cancer Starves Immune Cells
Researchers discovered that ovarian tumors hinder T cells’ energy supply by trapping a key protein, blocking lipid uptake. A new approach to reprogram T cells could enhance immunotherapy for aggressive cancers. Researchers at Weill Cornell [...]
Innovative Drug-Design Strategies to Overcome Antibacterial Resistance
Antibacterial resistance occurs when antibiotics fail to treat bacterial infections. This incidence is considered one of the top global health threats, stemming from the misuse or overuse of antibiotics in humans and animals.1 The [...]
Team introduces a cost-effective method to redesign search engines for AI
The internet search engine of the future will be powered by artificial intelligence. One can already choose from a host of AI-powered or AI-enhanced search engines—though their reliability often still leaves much to be [...]
Experiments demonstrate precise delivery of nanoparticles to lung
In recent years, bio-medical engineers have been developing promising techniques that could help diagnose diseases or precisely target specific regions inside the human body. Among these promising therapeutic strategies are methods that rely on [...]
What is Lassa fever? Everything to know about Ebola-like virus
Lassa fever has reached the US for the first time in a decade, in a case that has surprised health officials. The middle-aged patient in Iowa, who was not identified, died a 'short time' after being hospitalized [...]
Harvard Study Links Popular Plastic Ingredient to DNA Damage
Phthalate affects egg formation in C. elegans, resulting in abnormal chromosome numbers. A recent study conducted on roundworms has discovered that a common plastic ingredient can cause DNA strand breaks, leading to egg cells with an abnormal [...]
New research finds that subtle eye movements optimize vision
Our ability to see starts with the light-sensitive photoreceptor cells in our eyes. A specific region of the retina, termed fovea, is responsible for sharp vision. Here, the color-sensitive cone photoreceptors allow us to [...]