AutoBencher: Towards Declarative Benchmark Construction
Xiang Lisa Li, Farzaan Kaiyom, Evan Zheran Liu, Yifan Mai, Percy, Liang, Tatsunori Hashimoto

TL;DR
AutoBencher is a declarative framework that automatically constructs benchmarks for language models, revealing new insights and vulnerabilities by optimizing dataset creation based on specified desiderata.
Contribution
It introduces a scalable, optimization-based approach for automatic benchmark construction using language models, enabling targeted evaluation of capabilities and safety.
Findings
Creates datasets that increase model error rates by 22%
Identifies knowledge gaps in models like Gemini-Pro and GPT-4o
Demonstrates effectiveness across math, multilingual, knowledge, and safety benchmarks
Abstract
We present AutoBencher, a declarative framework for automatic benchmark construction, and use it to scalably discover novel insights and vulnerabilities of existing language models. Concretely, given a few desiderata of benchmarks (e.g., question difficulty, topic salience), we operationalize each desideratum and cast benchmark creation as an optimization problem. Specifically, we experiment with two settings with different optimization objectives: (i) for capability evaluation, we declare the goal of finding a salient, difficult dataset that induces novel performance patterns; (ii) for safety evaluation, we declare the goal of finding a dataset of unsafe prompts that existing LMs fail to decline. To tackle this optimization problem, we use a language model to iteratively propose and refine dataset descriptions, which are then used to generate topic-specific questions and answers. These…
Peer Reviews
Decision·ICLR 2025 Poster
1. The problem of automatic benchmark generation in a guided manner is an important one. While LMs have been used as judges to automatically evaluate other LM's answers, this work proposes using LMs to also generate questions. 2. The problem is formalized and packaged in an elegant and extensible way in the AutoBencher framework, and two important instances of the framework are studied. The two-step division (first generate topics and then generate datasets per topic) is especially novel and e
1. AutoBencher is currently stand-alone. The paper would be stronger if it integrated AutoBencher into existing popular evaluation frameworks like Stanford's HELM or HuggingFace's Open LLM. Adoption of AutoBencher by one of these frameworks would make a more convincing case for its usefulness and viability. 2. There seems to be a mismatch between the capabilities of the evaluating LM and the evaluated LMs: the former has access to tools whereas the latter does not. The paper does not make a con
- The paper conducts experiments across a wide range of knowledge domains. - The presentation is generally clear, with well-structured sections and a logical flow. - Equations and figures are used effectively to enhance clarity. - The automatic generation of benchmarks is a topic of strong interest to the community. - The proposed method demonstrates significant performance gains over human-generated benchmarks, particularly in terms of novelty, difficulty, and separability metrics as defined in
- It is not entirely clear to me what methodological novelties the paper introduces in dataset construction, aside from its use of retrieval during generation. Optimizing for various criteria objectives is not a particularly significant contribution. - The introduction and abstract could benefit from revisions to be more specific and descriptive. - The method and criteria rely on accuracy scores computed across both existing models and preexisting datasets, which can be computationally intensive
+ On novelty, the approach generates human interpretable topics where model ranks exhibit surprising results. + On safety, the qualitative examples align with the experience of this reviewer when trying to manually jail-break LLMs: pose the question as a hypothetical or philosophical debate; this vector being auto-discovered is encouraging. + Well-executed research methodology with manual validation of the discovered benchmarks
- The approach to the math and science categories suggests an open vocabulary problem that is not clearly tackled. For other categories, this is tackled via Wikipedia and a popularity metric. - The translation induces an issue for low-resource language: such languages are both less likely to be tackled by LLMs and by translation tools, creating a catch-22. (I feel it would be sufficient to acknowledge the issue as a limitation since tackling the issue requires significant manual effort, future w
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
