Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Haonan Duan; Stephen Zhewen Lu; Caitlin Fiona Harrigan; Nishkrit Desai; Jiarui Lu; Micha{\l} Koziarski; Leonardo Cotta; Chris J. Maddison

arXiv:2507.02083·cs.AI·July 15, 2025

Measuring Scientific Capabilities of Language Models with a Systems Biology Dry Lab

Haonan Duan, Stephen Zhewen Lu, Caitlin Fiona Harrigan, Nishkrit Desai, Jiarui Lu, Micha{\l} Koziarski, Leonardo Cotta, Chris J. Maddison

PDF

TL;DR

SciGym is a novel benchmark that evaluates language models' abilities to design and analyze biological experiments using simulated systems, revealing current limitations and potential for improvement in scientific reasoning.

Contribution

Introduces SciGym, a dry lab benchmark for assessing LLMs' experimental design and analysis skills in biology, overcoming wet-lab costs and enabling complex system testing.

Findings

01

More capable models perform better but still struggle with complex systems.

02

All models' performance drops significantly as system complexity increases.

03

Provides a new platform for advancing LLM scientific capabilities.

Abstract

Designing experiments and result interpretations are core scientific competencies, particularly in biology, where researchers perturb complex systems to uncover the underlying systems. Recent efforts to evaluate the scientific capabilities of large language models (LLMs) fail to test these competencies because wet-lab experimentation is prohibitively expensive: in expertise, time and equipment. We introduce SciGym, a first-in-class benchmark that assesses LLMs' iterative experiment design and analysis abilities in open-ended scientific discovery tasks. SciGym overcomes the challenge of wet-lab costs by running a dry lab of biological systems. These models, encoded in Systems Biology Markup Language, are efficient for generating simulated data, making them ideal testbeds for experimentation on realistically complex systems. We evaluated six frontier LLMs on 137 small systems, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.