AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

Yilun Zhao; Weiyuan Chen; Zhijian Xu; Manasi Patwardhan; Yixin Liu; Chengye Wang; Lovekesh Vig; Arman Cohan

arXiv:2507.13300·cs.CL·July 18, 2025

AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research

Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan

PDF

Open Access

TL;DR

AbGen is a benchmark for evaluating large language models' ability to design ablation studies in scientific research, revealing current models' limitations and the unreliability of automated evaluation methods.

Contribution

This paper introduces AbGen, the first benchmark for assessing LLMs in designing scientific ablation studies, and develops AbGen-Eval to evaluate automated assessment reliability.

Findings

01

LLMs lag behind human experts in ablation study design quality.

02

Current automated evaluation methods are unreliable for complex scientific tasks.

03

Developed AbGen-Eval to measure the reliability of LLM performance assessments.

Abstract

We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management