From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu,, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

TL;DR
This paper presents BenchBuilder, an automated pipeline leveraging LLMs to curate high-quality, challenging benchmarks from crowd-sourced datasets, enabling scalable, continuous evaluation of large language models with high alignment to human preferences.
Contribution
The introduction of BenchBuilder, an automated, LLM-based pipeline for scalable benchmark curation and evaluation, reducing manual effort and costs.
Findings
BenchBuilder successfully curated Arena-Hard-Auto with 500 challenging prompts.
The benchmark achieves 3x higher separation of model performances compared to MT-Bench.
It correlates 98.6% with human preference rankings at a cost of $20.
Abstract
The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications
MethodsALIGN
