LLMs Outperform Experts on Challenging Biology Benchmarks
Lennart Justen

TL;DR
This paper evaluates 27 large language models on eight challenging biology benchmarks, showing dramatic performance improvements that now rival or surpass expert-level knowledge in several areas.
Contribution
It provides a comprehensive, systematic assessment of recent LLMs on advanced biology tasks, highlighting their rapid progress and limitations in current benchmarks.
Findings
Models outperform experts on virology benchmarks.
Performance increased over 4-fold on Virology Capabilities Test.
Benchmarks show saturation and data errors, indicating need for better evaluation methods.
Abstract
This study systematically evaluates 27 frontier Large Language Models on eight biology benchmarks spanning molecular biology, genetics, cloning, virology, and biosecurity. Models from major AI developers released between November 2022 and April 2025 were assessed through ten independent runs per benchmark. The findings reveal dramatic improvements in biological capabilities. Top model performance increased more than 4-fold on the challenging text-only subset of the Virology Capabilities Test over the study period, with OpenAI's o3 now performing twice as well as expert virologists. Several models now match or exceed expert-level performance on other challenging benchmarks, including the biology subsets of GPQA and WMDP and LAB-Bench CloningScenarios. Contrary to expectations, chain-of-thought did not substantially improve performance over zero-shot evaluation, while extended reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · Artificial Intelligence in Healthcare and Education
