AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Talor Abramovich, Gal Chechik

TL;DR
AblationBench is a new benchmark suite designed to evaluate language models' ability to plan ablation experiments in empirical AI research, highlighting current limitations and guiding future improvements.
Contribution
Introduces AblationBench, a benchmark with two tasks for assessing LM performance in ablation planning, and analyzes the challenges faced by current models.
Findings
Current LMs identify only 38% of ablations, below human performance.
Performance varies inversely between author and reviewer tasks.
Chain-of-thought prompting outperforms agent-based approaches.
Abstract
Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
