BenchBench: Benchmarking Automated Benchmark Generation

Yandan Zheng; Haoran Luo; Zhenghong Lin; Wenjin Liu; Luu Anh Tuan

arXiv:2603.20807·cs.CL·March 24, 2026

BenchBench: Benchmarking Automated Benchmark Generation

Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan

PDF

Open Access

TL;DR

BenchBench introduces a comprehensive pipeline for automated benchmark generation and validation, enabling scalable, diverse, and reliable evaluation of large language models across multiple domains and modalities.

Contribution

It presents a novel three-stage pipeline for automated benchmark creation, including extraction, generation, and validation, advancing the evaluation methodology for large language models.

Findings

01

Generated 16.7K items across nine domains.

02

Benchmark-design ability is only moderately correlated with answer-time strength.

03

Invalidity negatively impacts discrimination in benchmark items.

Abstract

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification