# Enhancing missense variant pathogenicity prediction with protein language models using VariPred

**Authors:** Weining Lin, Jude Wells, Zeyuan Wang, Christine Orengo, Andrew C. R. Martin

PMC · DOI: 10.1038/s41598-024-51489-7 · Scientific Reports · 2024-04-07

## TL;DR

This paper introduces VariPred, a new method that uses protein language models to predict if genetic variants are harmful, outperforming existing tools.

## Contribution

The novel framework VariPred uses pre-trained protein language models for variant pathogenicity prediction without structural features or alignments.

## Key findings

- VariPred outperforms state-of-the-art methods in predicting variant pathogenicity.
- The model uses only protein sequence data without requiring structural features or alignments.
- VariPred performs well on six benchmark datasets for variant impact prediction.

## Abstract

Computational approaches for predicting the pathogenicity of genetic variants have advanced in recent years. These methods enable researchers to determine the possible clinical impact of rare and novel variants. Historically these prediction methods used hand-crafted features based on structural, evolutionary, or physiochemical properties of the variant. In this study we propose a novel framework that leverages the power of pre-trained protein language models to predict variant pathogenicity. We show that our approach VariPred (Variant impact Predictor) outperforms current state-of-the-art methods by using an end-to-end model that only requires the protein sequence as input. Using one of the best-performing protein language models (ESM-1b), we establish a robust classifier that requires no calculation of structural features or multiple sequence alignments. We compare the performance of VariPred with other representative models including 3Cnet, Polyphen-2, REVEL, MetaLR, FATHMM and ESM variant. VariPred performs as well as, or in most cases better than these other predictors using six variant impact prediction benchmarks despite requiring only sequence data and no pre-processing of the data.

## Full-text entities

- **Diseases:** ESM-1b (MESH:C567213)
- **Chemicals:** amino acid (MESH:D000596), Proline (MESH:D011392), Glycine (MESH:D005998), acid (MESH:D000143), 3Cnet (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]
- **Mutations:** p.Gly56Ser

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC10999449/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC10999449/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC10999449/full.md

---
Source: https://tomesphere.com/paper/PMC10999449