BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

Zifeng Wang; Benjamin Danek; Jimeng Sun

arXiv:2505.16100·cs.AI·May 23, 2025

BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research

Zifeng Wang, Benjamin Danek, Jimeng Sun

PDF

Open Access

TL;DR

BioDSA-1K is a comprehensive benchmark with over 1,000 biomedical hypothesis validation tasks designed to evaluate AI agents on realistic scientific reasoning, evidence interpretation, and analysis execution in biomedical research.

Contribution

This work introduces BioDSA-1K, the first large-scale benchmark for testing AI agents on authentic biomedical hypothesis validation tasks derived from published studies.

Findings

01

Benchmark covers 1,029 tasks with 1,177 analysis plans

02

Evaluates hypothesis accuracy, evidence alignment, reasoning correctness, and code executability

03

Includes non-verifiable hypotheses to reflect real-world scientific uncertainty

Abstract

Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies