BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang Li Yuan, Yonghong Tian

TL;DR
BioProBench introduces a large-scale dataset and benchmark for improving biological protocol understanding and reasoning in language models, addressing current limitations in procedural accuracy and safety in scientific AI.
Contribution
The paper presents BioProBench and BioProCorpus, providing a comprehensive dataset and benchmark for biological protocol reasoning, and introduces ProAgent to enhance LLM performance in this domain.
Findings
LLMs perform well on general comprehension but struggle with deep reasoning.
Performance drops significantly on tasks requiring quantitative precision.
ProAgent, based on BioProCorpus, advances the state-of-the-art.
Abstract
The realization of autonomous scientific experimentation is currently limited by LLMs' struggle to grasp the strict procedural logic and accuracy required by biological protocols. To address this fundamental challenge, we present \textbf{BioProBench}, a comprehensive resource for procedural reasoning in biology. BioProBench is grounded in \textbf{BioProCorpus}, a foundational collection of 27,000 human-written protocols. From this corpus, we systematically constructed a dataset of over 550,000 task instances, offering both a large-scale training resource and a rigorous benchmark with novel metrics. Evaluating 10 mainstream LLMs, we find that while general comprehension is high, performance drops significantly on tasks demanding deep reasoning, quantitative precision, and safety awareness. To demonstrate the value of BioProCorpus in mitigating these issues, we developed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
