BenchBench: Benchmarking Automated Benchmark Generation
Yandan Zheng, Haoran Luo, Zhenghong Lin, Wenjin Liu, Luu Anh Tuan

TL;DR
BenchBench introduces a comprehensive pipeline for automated benchmark generation and validation, enabling scalable, diverse, and reliable evaluation of large language models across multiple domains and modalities.
Contribution
It presents a novel three-stage pipeline for automated benchmark creation, including extraction, generation, and validation, advancing the evaluation methodology for large language models.
Findings
Generated 16.7K items across nine domains.
Benchmark-design ability is only moderately correlated with answer-time strength.
Invalidity negatively impacts discrimination in benchmark items.
Abstract
Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges, introducing additional sources of bias and prompt sensitivity. We argue that evaluation must extend beyond how well models answer benchmarks to how well models design them. We introduce BenchBench, a three-stage pipeline and dataset for benchmarking automated benchmark generation: (i) extract structured domain cards from seed benchmarks, (ii) prompt multiple designer LLMs to generate quota-controlled suites, and (iii) validate items with a multi-model answerer panel using exact/numeric/symbolic verifiers when possible and rubric-guided judging otherwise, yielding designer--answerer matrices with item-level quality flags and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
