DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana, Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot,, Ashish Sabharwal, Peter Clark

TL;DR
DiscoveryBench is a comprehensive benchmark designed to evaluate large language models' capabilities in automating multi-step data-driven discovery across diverse real-world and synthetic tasks, highlighting current limitations.
Contribution
This work introduces DiscoveryBench, the first benchmark formalizing multi-step data-driven discovery tasks for LLM evaluation, with diverse real and synthetic datasets and a structured formalism for analysis.
Findings
Current LLMs score only around 25% on the benchmark
The benchmark covers 264 real-world and 903 synthetic tasks
Structured evaluation reveals specific failure modes in LLM discovery capabilities
Abstract
Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct…
Peer Reviews
Decision·ICLR 2025 Poster
S1) The authors introduce a simple yet expressive notion of data-driven hypothesis. Discovering and validating such hypotheses from a dataset is a challenging problem for LLMs. S2) The authors develop a comprehensive benchmark to test the capabilities of LLMs in discovering data-driven hypotheses. S3) The authors use the benchmark to test some popular LLM-based reasoning frameworks, drawing useful conclusions about the capabilities of these systems in discovering data-driven hypotheses.
W1) As part of the benchmark, the authors developed some synthetic tests. These tests are supposed to capture synthetic task examples constructed from workflows in published works. However, the authors do not clearly explain in what sense these synthetic tests properly represent these workflows.
- The role of LLMs in the scientific method is still unknown (if one exists), and evaluating their capacity to accelerate the process of knowledge discovery is a topic of significant interest. This work represents an incremental advancement in this domain, providing a formal definition of data-driven hypotheses and expanding the search space for these hypotheses. The paper also proposes evaluating proposed hypotheses against a gold standard using semantic similarity measures. In my opinion, this
- The process of finding a data-driven hypothesis can be time/energy-consuming. This aspect is not discussed in the paper, and I understand that the page limit might not allow space for such discussions. Although this is not the main focus of the paper, it could be useful to explore whether there is a relationship between the size of the search space and the performance of the results. - The proposed formalism for defining data-driven hypotheses, discovery goals, and task difficulty is somewhat
The authors attempt to formalize the problem of data-driven discovery, which, in my opinion, is a crucial step towards solving this task. In addition, a benchmark is introduced -- unfortunately, I did not have the chance to look at it in more detail, but the authors claim that they will publicly release it.
I am confused with the formulation of the problem of data-driven discovery as in Section 3. In line 049, the authors give an example of a data-driven discovery task: “How did urban land use affect the invasion of introduced plants in Catalonia?” The answer to this task is the ways urban land used affected the invasion of introduced plants in Catalonia (if this is the case). However, in Section 3, line 152, you define a hypothesis as a tuple of three elements and in line 141, you say that a hypot
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Advanced Database Systems and Queries
MethodsSparse Evolutionary Training
