HeurekaBench: A Benchmarking Framework for AI Co-scientist
Siba Smarak Panigrahi, Jovana Videnovi\'c, Maria Brbi\'c

TL;DR
HeurekaBench is a new benchmarking framework that enables realistic, end-to-end evaluation of AI co-scientists in scientific research, specifically demonstrated in single-cell biology, by creating open-ended research questions grounded in real data and workflows.
Contribution
The paper introduces HeurekaBench, a semi-automated framework for creating scientific benchmarks with exploratory questions, and demonstrates its use in evaluating and improving AI scientific agents.
Findings
Adding a critic module improves response quality by up to 22%.
HeurekaBench enables rigorous, real-world evaluation of scientific AI agents.
The framework facilitates comparison of different agent design choices.
Abstract
LLM-based reasoning models have enabled the development of agentic systems that act as co-scientists, assisting in multi-step scientific analysis. However, evaluating these systems is challenging, as it requires realistic, end-to-end research scenarios that integrate data analysis, interpretation, and the generation of new insights from the experimental data. To address this limitation, we introduce HeurekaBench, a framework to create benchmarks with exploratory, open-ended research questions for experimental datasets. Each such question is grounded in a scientific study and its corresponding code repository, and is created using a semi-automated pipeline that leverages multiple LLMs to extract insights and generate candidate workflows, which are then verified against reported findings. We instantiate the framework in single-cell biology to obtain sc-HeurekaBench benchmark and use it to…
Peer Reviews
Decision·ICLR 2026 Poster
This paper proposes a novel framework for building benchmarks to evaluate LLM-based agents acting as AI co-scientists, effectively addressing the critical gap in evaluating open-ended, data-driven scientific discovery agent. Moreover, the paper is of relatively high quality, providing a comprehensive and clear exposition of its semi-automated pipeline and evaluation methodology, advancing beyond narrow task-solving, and demonstrating significant practical value of a critic module for the develop
The experiments in this paper are insufficient, as they lack comparative analysis with other benchmark construction methodologies[^1]. Without such comparisons, the work fails to fully demonstrate its superiority. Additionally, this study does not provide experimental analysis on the role of large language models as evaluators, such as comparing their performance with expert assessments to validate the reliability of this evaluation approach. [^1]: Erpai Luo, Jinmeng Jia, Yifan Xiong, Xiangyu L
- The paper addresses an important gap in evaluating AI agents for scientific discovery by moving beyond simple factual recall or single-step computation tasks to assess genuine exploratory, multi-step data-driven reasoning capabilities. - The semi-automated pipeline with human verification for insight validation is well-designed, ensuring that benchmark questions are grounded in reproducible scientific findings rather than relying solely on LLM generation capabilities. - The experimental anal
- The benchmark's scope is limited to only 13 papers and 50 questions in single-cell biology, raising concerns about generalizability and whether this sample size is sufficient to robustly evaluate agent capabilities across the diversity of real scientific discovery scenarios. - The evaluation relies heavily on GPT-4o as an LLM judge for OEQs, which introduces potential biases and may favor agents using similar models, yet the paper provides limited analysis of inter-rater reliability or valida
1. Questions are grounded in real papers, data, and code, and require multi-step analysis and evidence-based reasoning rather than pure retrieval or recall—matching the “co-scientist” setting. 2. The benchmark compares multiple single-cell agents under a common setup and systematically ablates planner/critic/retriever to quantify their impact and provide design takeaways.
1. The evaluation relies on LLM-as-judge with an atomic-facts rubric (GPT-4o), and the manuscript does not report human adjudication or agreement statistics—leaving uncertainty and potential bias. 2. MCQ choices are LLM-generated; the authors acknowledge some “incorrect” options can appear scientifically plausible, so they also report precision/recall—indicating limited answer uniqueness that may complicate assessment. 3. Methodological details about the agent/tooling side remain sparse: the pap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Cell Image Analysis Techniques · Zebrafish Biomedical Research Applications
