AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich; Gal Chechik

arXiv:2507.08038·cs.CL·February 3, 2026

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich, Gal Chechik

PDF

4 Datasets

TL;DR

AblationBench is a new benchmark suite designed to evaluate language models' ability to plan ablation experiments in empirical AI research, highlighting current limitations and guiding future improvements.

Contribution

Introduces AblationBench, a benchmark with two tasks for assessing LM performance in ablation planning, and analyzes the challenges faced by current models.

Findings

01

Current LMs identify only 38% of ablations, below human performance.

02

Performance varies inversely between author and reviewer tasks.

03

Chain-of-thought prompting outperforms agent-based approaches.

Abstract

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.