Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Tingting Chen; Srinivas Anumasa; Beibei Lin; Vedant Shah; Anirudh; Goyal; Dianbo Liu

arXiv:2502.15224·cs.LG·February 24, 2025

Auto-Bench: An Automated Benchmark for Scientific Discovery in LLMs

Tingting Chen, Srinivas Anumasa, Beibei Lin, Vedant Shah, Anirudh, Goyal, Dianbo Liu

PDF

TL;DR

Auto-Bench is a new standardized benchmark designed to evaluate Large Language Models' ability to perform scientific discovery tasks involving causal reasoning, hypothesis generation, and decision-making in natural and social sciences.

Contribution

The paper introduces Auto-Bench, the first comprehensive benchmark for assessing LLMs' scientific discovery capabilities through causal graph discovery tasks.

Findings

01

State-of-the-art LLMs perform poorly as problem complexity increases.

02

Significant performance gap exists between current LLMs and human scientists.

03

Auto-Bench enables systematic evaluation of LLMs' scientific reasoning abilities.

Abstract

Given the remarkable performance of Large Language Models (LLMs), an important question arises: Can LLMs conduct human-like scientific research and discover new knowledge, and act as an AI scientist? Scientific discovery is an iterative process that demands efficient knowledge updating and encoding. It involves understanding the environment, identifying new hypotheses, and reasoning about actions; however, no standardized benchmark specifically designed for scientific discovery exists for LLM agents. In response to these limitations, we introduce a novel benchmark, \textit{Auto-Bench}, that encompasses necessary aspects to evaluate LLMs for scientific discovery in both natural and social sciences. Our benchmark is based on the principles of causal graph discovery. It challenges models to uncover hidden structures and make optimal decisions, which includes generating valid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Residual Connection · Absolute Position Encodings · Linear Layer · Layer Normalization · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer