LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
Jon M Laurent, Albert Bou, Michael Pieler, Conor Igoe, Alex Andonian, Siddharth Narayanan, James Braza, Alexandros Sanchez Vassopoulos, Jacob L Steenwyk, Blake Lash, Andrew D White, Samuel G Rodriques

TL;DR
LABBench2 is an advanced benchmark with nearly 1,900 tasks designed to evaluate AI systems' real-world scientific research capabilities, showing significant difficulty and room for improvement over previous benchmarks.
Contribution
It introduces LABBench2, an improved, more challenging benchmark for assessing AI's practical scientific research abilities, expanding on prior work with more realistic tasks.
Findings
Current models show substantial performance gaps on LABBench2.
LABBench2's difficulty range indicates significant room for AI development.
Performance improvements over LAB-Bench are notable but still incomplete.
Abstract
Optimism for accelerating scientific discovery with AI continues to grow. Current applications of AI in scientific research range from training dedicated foundation models on scientific data to agentic autonomous hypothesis generation systems to AI-driven autonomous labs. The need to measure progress of AI systems in scientific domains correspondingly must not only accelerate, but increasingly shift focus to more real-world capabilities. Beyond rote knowledge and even just reasoning to actually measuring the ability to perform meaningful work. Prior work introduced the Language Agent Biology Benchmark LAB-Bench as an initial attempt at measuring these abilities. Here we introduce an evolution of that benchmark, LABBench2, for measuring real-world capabilities of AI systems performing useful scientific tasks. LABBench2 comprises nearly 1,900 tasks and is, for the most part, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
