BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

Yuyang Liu; Liuzhenghao Lv; Xiancheng Zhang; Jingya Wang Li Yuan; Yonghong Tian

arXiv:2505.07889·cs.CL·January 22, 2026

BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang Li Yuan, Yonghong Tian

PDF

1 Repo 3 Datasets

TL;DR

BioProBench introduces a large-scale dataset and benchmark for improving biological protocol understanding and reasoning in language models, addressing current limitations in procedural accuracy and safety in scientific AI.

Contribution

The paper presents BioProBench and BioProCorpus, providing a comprehensive dataset and benchmark for biological protocol reasoning, and introduces ProAgent to enhance LLM performance in this domain.

Findings

01

LLMs perform well on general comprehension but struggle with deep reasoning.

02

Performance drops significantly on tasks requiring quantitative precision.

03

ProAgent, based on BioProCorpus, advances the state-of-the-art.

Abstract

The realization of autonomous scientific experimentation is currently limited by LLMs' struggle to grasp the strict procedural logic and accuracy required by biological protocols. To address this fundamental challenge, we present \textbf{BioProBench}, a comprehensive resource for procedural reasoning in biology. BioProBench is grounded in \textbf{BioProCorpus}, a foundational collection of 27,000 human-written protocols. From this corpus, we systematically constructed a dataset of over 550,000 task instances, offering both a large-scale training resource and a rigorous benchmark with novel metrics. Evaluating 10 mainstream LLMs, we find that while general comprehension is high, performance drops significantly on tasks demanding deep reasoning, quantitative precision, and safety awareness. To demonstrate the value of BioProCorpus in mitigating these issues, we developed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuyangsunshine/bioprotocolbench
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.