BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery
Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

TL;DR
BoxingGym is a comprehensive benchmark for evaluating AI agents' abilities to design experiments and discover scientific models across diverse domains, highlighting current limitations of LLMs like GPT-4.
Contribution
This paper introduces BoxingGym, a novel benchmark with 10 environments for systematic evaluation of experimental design and model discovery in AI agents.
Findings
Current LLMs struggle with experimental design and model discovery.
Augmenting LLMs with explicit statistical models does not reliably improve performance.
The benchmark enables quantitative assessment of scientific reasoning in AI.
Abstract
Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Sports Analytics and Performance · Time Series Analysis and Forecasting
