Unsupervised language models for disease variant prediction
Allan Zhou, Nicholas C. Landolfi, Daniel C. O'Neill

TL;DR
This paper introduces VELM, an unsupervised method combining pretrained protein language models with evolutionary principles to predict disease-related protein variant pathogenicity without gene-specific training.
Contribution
The study demonstrates that a single pretrained protein language model can accurately predict variant pathogenicity across genes without MSAs or fine-tuning, outperforming existing methods.
Findings
VELM achieves state-of-the-art performance on clinical variant datasets.
It operates in a zero-shot manner, without gene-specific training.
The approach simplifies variant scoring by leveraging broad sequence data.
Abstract
There is considerable interest in predicting the pathogenicity of protein variants in human genes. Due to the sparsity of high quality labels, recent approaches turn to \textit{unsupervised} learning, using Multiple Sequence Alignments (MSAs) to train generative models of natural sequence variation within each gene. These generative models then predict variant likelihood as a proxy to evolutionary fitness. In this work we instead combine this evolutionary principle with pretrained protein language models (LMs), which have already shown promising results in predicting protein structure and function. Instead of training separate models per-gene, we find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot, without MSAs or finetuning. We call this unsupervised approach \textbf{VELM} (Variant Effect via Language Models), and show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Biomedical Text Mining and Ontologies · Machine Learning in Bioinformatics
