Automating Benchmark Design

Amanda Dsouza; Harit Vishwakarma; Zhengyang Qi; Justin Bauer; Derek Pham; Thomas Walshe; Armin Parchami; Frederic Sala; Paroma Varma

arXiv:2510.25039·cs.SE·October 30, 2025

Automating Benchmark Design

Amanda Dsouza, Harit Vishwakarma, Zhengyang Qi, Justin Bauer, Derek Pham, Thomas Walshe, Armin Parchami, Frederic Sala, Paroma Varma

PDF

3 Reviews

TL;DR

This paper introduces BeTaL, an LLM-in-the-loop framework that automates the design of dynamic benchmarks for evaluating LLMs, achieving more accurate difficulty levels than traditional methods.

Contribution

We develop a novel framework that automates dynamic benchmark creation using LLM reasoning, enabling cost-efficient and precise difficulty tuning.

Findings

01

BeTaL creates benchmarks with difficulty levels within 5.3% to 13.2% of targets.

02

BeTaL outperforms baselines by 2-4 times in benchmark difficulty accuracy.

03

Extended a popular agentic benchmark, demonstrating BeTaL's versatility.

Abstract

The rapid progress and widespread deployment of LLMs and LLM-powered agents has outpaced our ability to evaluate them. Hand-crafted, static benchmarks are the primary tool for assessing model capabilities, but these quickly become saturated. In contrast, dynamic benchmarks evolve alongside the models they evaluate, but are expensive to create and continuously update. To address these challenges, we develop BeTaL (Benchmark Tuning with an LLM-in-the-loop), a framework that leverages environment design principles to automate the process of dynamic benchmark design. BeTaL works by parameterizing key design choices in base benchmark templates and uses LLMs to reason through the resulting parameter space to obtain target properties (such as difficulty and realism) in a cost-efficient manner. We validate this approach on its ability to create benchmarks with desired difficulty levels. Using…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The authors define the "benchmark design process as an optimization problem" and demonstrate that their "empirical results show BeTaL consistently obtains benchmarks with any given target difficulty, achieving a performance gap of as low as 0.4% and up to 5% in several settings, a significant improvement over baselines." This is an important contribution, leading to environment setups that are well suited to a specific performance level and can be used to test a variety of different models. Suc

Weaknesses

While the paper provides a valuable contribution toward a benchmark contains tasks with different complexity levels, I believe to strengthen the authors claims' about its universality and adaptability, the benchmarks should have more been tested on target models of different sizes. It is a bit unclear to me what is the different between the 'target' and 'evaluating' models provided by authors in Section 4.3, stating that 'We use o4-mini as the target model in all the settings. We finally evaluat

Reviewer 02Rating 4Confidence 3

Strengths

The paper tackles a clear and significant problem for the community: the saturation of static evaluation benchmarks and the high cost of manually updating dynamic ones. The goal of automating this process is well-motivated and valuable.

Weaknesses

1. The framework's primary weakness is its reliance on access to parameterized and verifiable simulators. This assumption is extremely strong and does not hold for many, if not most, complex and realistic evaluation domains. While feasible for the toy-like "Arithmetic Sequences" or "Spatial Reasoning" grid world, this is a critical bottleneck for applying BeTaL to open-ended domains like agentic web tasks, robotics, or complex code generation, where a verifiable simulator is often as hard to bui

Reviewer 03Rating 4Confidence 3

Strengths

The idea to extend UED to automate full benchmark design with LLMs is interesting, and an important problem.

Weaknesses

The environments being evolved are very toyish The idea of UED from my understanding is to automatically create environments/tasks at the right level of difficulty for agents/policies to perform RL on. If the task is too easy or too difficult, the target agent/policy will not benefit from training in it. BeTaL attempts to solve the designer problem of creating tasks at the right difficulty, however it then does not attempt to have weaker models/policies train on those tasks, starting from the t

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.