# Disease-Specific Prediction of Missense Variant Pathogenicity with DNA Language Models and Graph Neural Networks

**Authors:** Mohamed Ghadie, Sameer Sardaar, Yannis Trakadis

PMC · DOI: 10.3390/bioengineering12101098 · Bioengineering · 2025-10-13

## TL;DR

This paper introduces a new method to predict how genetic changes affect specific diseases using advanced machine learning techniques.

## Contribution

The novel approach integrates DNA language models and graph neural networks with disease-specific domain knowledge for variant pathogenicity prediction.

## Key findings

- The model achieved a prediction-balanced accuracy of 85.6% for disease-specific variant classification.
- It demonstrated high sensitivity (90.5%) and negative predictive value (89.8%) in predicting pathogenic variants.
- The use of a knowledge graph with interconnected biomedical entities improved variant interpretation.

## Abstract

Accurate prediction of the impact of genetic variants on human health is of paramount importance to clinical genetics and precision medicine. Recent machine learning (ML) studies have tried to predict variant pathogenicity with different levels of success. However, most missense variants identified on a clinical basis are still classified as variants of uncertain significance (VUS). Our approach allows for the interpretation of a variant for a specific disease and, thus, for the integration of disease-specific domain knowledge. We utilize a comprehensive knowledge graph, with 11 types of interconnected biomedical entities at diverse biomolecular and clinical levels, to classify missense variants from ClinVar. We use BioBERT to generate embeddings of biomedical features for each node in the graph, as well as DNA language models to embed variant features directly from genomic sequence. Next, we train a two-stage architecture consisting of a graph convolutional neural network to encode biological relationships. A neural network is then used as the classifier to predict disease-specific pathogenicity of variants, essentially predicting edges between variant and disease nodes. We compare performance across different versions of our model, obtaining prediction-balanced accuracies as high as 85.6% (sensitivity: 90.5%; NPV: 89.8%) and discuss how our work can inform future studies in this area.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12562010/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12562010/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/PMC12562010/full.md

---
Source: https://tomesphere.com/paper/PMC12562010