BioDSA-1K: Benchmarking Data Science Agents for Biomedical Research
Zifeng Wang, Benjamin Danek, Jimeng Sun

TL;DR
BioDSA-1K is a comprehensive benchmark with over 1,000 biomedical hypothesis validation tasks designed to evaluate AI agents on realistic scientific reasoning, evidence interpretation, and analysis execution in biomedical research.
Contribution
This work introduces BioDSA-1K, the first large-scale benchmark for testing AI agents on authentic biomedical hypothesis validation tasks derived from published studies.
Findings
Benchmark covers 1,029 tasks with 1,177 analysis plans
Evaluates hypothesis accuracy, evidence alignment, reasoning correctness, and code executability
Includes non-verifiable hypotheses to reflect real-world scientific uncertainty
Abstract
Validating scientific hypotheses is a central challenge in biomedical research, and remains difficult for artificial intelligence (AI) agents due to the complexity of real-world data analysis and evidence interpretation. In this work, we present BioDSA-1K, a benchmark designed to evaluate AI agents on realistic, data-driven biomedical hypothesis validation tasks. BioDSA-1K consists of 1,029 hypothesis-centric tasks paired with 1,177 analysis plans, curated from over 300 published biomedical studies to reflect the structure and reasoning found in authentic research workflows. Each task includes a structured hypothesis derived from the original study's conclusions, expressed in the affirmative to reflect the language of scientific reporting, and one or more pieces of supporting evidence grounded in empirical data tables. While these hypotheses mirror published claims, they remain testable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies
