DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder; Harshit Surana; Dhruv Agarwal; Bhavana; Dalvi Mishra; Abhijeetsingh Meena; Aryan Prakhar; Tirth Vora; Tushar Khot,; Ashish Sabharwal; Peter Clark

arXiv:2407.01725·cs.CL·July 3, 2024·2 cites

DiscoveryBench: Towards Data-Driven Discovery with Large Language Models

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana, Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot,, Ashish Sabharwal, Peter Clark

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

DiscoveryBench is a comprehensive benchmark designed to evaluate large language models' capabilities in automating multi-step data-driven discovery across diverse real-world and synthetic tasks, highlighting current limitations.

Contribution

This work introduces DiscoveryBench, the first benchmark formalizing multi-step data-driven discovery tasks for LLM evaluation, with diverse real and synthetic datasets and a structured formalism for analysis.

Findings

01

Current LLMs score only around 25% on the benchmark

02

The benchmark covers 264 real-world and 903 synthetic tasks

03

Structured evaluation reveals specific failure modes in LLM discovery capabilities

Abstract

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. The benchmark is designed to systematically assess current model capabilities in discovery tasks and provide a useful resource for improving them. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from published papers to approximate the real-world challenges faced by researchers, where each task is defined by a dataset, its metadata, and a discovery goal in natural language. We additionally provide 903 synthetic tasks to conduct…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

S1) The authors introduce a simple yet expressive notion of data-driven hypothesis. Discovering and validating such hypotheses from a dataset is a challenging problem for LLMs. S2) The authors develop a comprehensive benchmark to test the capabilities of LLMs in discovering data-driven hypotheses. S3) The authors use the benchmark to test some popular LLM-based reasoning frameworks, drawing useful conclusions about the capabilities of these systems in discovering data-driven hypotheses.

Weaknesses

W1) As part of the benchmark, the authors developed some synthetic tests. These tests are supposed to capture synthetic task examples constructed from workflows in published works. However, the authors do not clearly explain in what sense these synthetic tests properly represent these workflows.

Reviewer 02Rating 8Confidence 4

Strengths

- The role of LLMs in the scientific method is still unknown (if one exists), and evaluating their capacity to accelerate the process of knowledge discovery is a topic of significant interest. This work represents an incremental advancement in this domain, providing a formal definition of data-driven hypotheses and expanding the search space for these hypotheses. The paper also proposes evaluating proposed hypotheses against a gold standard using semantic similarity measures. In my opinion, this

Weaknesses

- The process of finding a data-driven hypothesis can be time/energy-consuming. This aspect is not discussed in the paper, and I understand that the page limit might not allow space for such discussions. Although this is not the main focus of the paper, it could be useful to explore whether there is a relationship between the size of the search space and the performance of the results. - The proposed formalism for defining data-driven hypotheses, discovery goals, and task difficulty is somewhat

Reviewer 03Rating 5Confidence 4

Strengths

The authors attempt to formalize the problem of data-driven discovery, which, in my opinion, is a crucial step towards solving this task. In addition, a benchmark is introduced -- unfortunately, I did not have the chance to look at it in more detail, but the authors claim that they will publicly release it.

Weaknesses

I am confused with the formulation of the problem of data-driven discovery as in Section 3. In line 049, the authors give an example of a data-driven discovery task: “How did urban land use affect the invasion of introduced plants in Catalonia?” The answer to this task is the ways urban land used affected the invasion of introduced plants in Catalonia (if this is the case). However, in Section 3, line 152, you define a hypothesis as a tuple of three elements and in line 141, you say that a hypot

Code & Models

Repositories

allenai/discoverybench
noneOfficial

Datasets

nhop/discoverybench
dataset· 513 dl
513 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Advanced Database Systems and Queries

MethodsSparse Evolutionary Training