ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins
Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab

TL;DR
ViroGym introduces a large-scale benchmark for evaluating protein language models on viral proteins, assessing their ability to predict mutations, antigenic diversity, and pandemic emergence.
Contribution
This work presents ViroGym, a comprehensive benchmark with diverse tasks and datasets for systematically evaluating pLMs on viral protein prediction.
Findings
ProGen2 consistently outperforms other models across tasks.
DMS and neutralisation benchmarks predict real-world mutation emergence.
Complementary in vitro benchmarks capture evolutionary constraints.
Abstract
Protein language models (pLMs) have shown strong potential for zero-shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real-world pandemic prediction task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · vaccines and immunoinformatics approaches · Influenza Virus Research Studies
