Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

Jasper G\"otting; Pedro Medeiros; Jon G Sanders; Nathaniel Li; Long; Phan; Karam Elabd; Lennart Justen; Dan Hendrycks; Seth Donoughe

arXiv:2504.16137·cs.CY·April 30, 2025·2 cites

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

Jasper G\"otting, Pedro Medeiros, Jon G Sanders, Nathaniel Li, Long, Phan, Karam Elabd, Lennart Justen, Dan Hendrycks, Seth Donoughe

PDF

Open Access 1 Repo

TL;DR

The paper introduces VCT, a challenging multimodal benchmark for virology troubleshooting, revealing that advanced LLMs outperform most expert virologists, raising dual-use concerns and governance issues.

Contribution

It presents the VCT benchmark, constructed from expert input, to evaluate LLMs' virology troubleshooting capabilities, highlighting their potential and risks.

Findings

01

LLMs outperform most expert virologists on VCT.

02

VCT is highly challenging for even experts.

03

Publicly available models show dual-use risks.

Abstract

We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of $322$ multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of $22.1%$ on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI's o3, reaches $43.8%$ accuracy, outperforming $94%$ of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lennijusten/biology-benchmarks
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRespiratory viral infections research · Animal Disease Management and Epidemiology