From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and   BenchBuilder Pipeline

Tianle Li; Wei-Lin Chiang; Evan Frick; Lisa Dunlap; Tianhao Wu,; Banghua Zhu; Joseph E. Gonzalez; Ion Stoica

arXiv:2406.11939·cs.LG·October 16, 2024·6 cites

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu,, Banghua Zhu, Joseph E. Gonzalez, Ion Stoica

PDF

Open Access 4 Repos 2 Models 4 Datasets

TL;DR

This paper presents BenchBuilder, an automated pipeline leveraging LLMs to curate high-quality, challenging benchmarks from crowd-sourced datasets, enabling scalable, continuous evaluation of large language models with high alignment to human preferences.

Contribution

The introduction of BenchBuilder, an automated, LLM-based pipeline for scalable benchmark curation and evaluation, reducing manual effort and costs.

Findings

01

BenchBuilder successfully curated Arena-Hard-Auto with 500 challenging prompts.

02

The benchmark achieves 3x higher separation of model performances compared to MT-Bench.

03

It correlates 98.6% with human preference rankings at a cost of $20.

Abstract

The rapid evolution of Large Language Models (LLMs) has outpaced the development of model evaluation, highlighting the need for continuous curation of new, challenging benchmarks. However, manual curation of high-quality, human-aligned benchmarks is expensive and time-consuming. To address this, we introduce BenchBuilder, an automated pipeline that leverages LLMs to curate high-quality, open-ended prompts from large, crowd-sourced datasets, enabling continuous benchmark updates without human in the loop. We apply BenchBuilder to datasets such as Chatbot Arena and WildChat-1M, extracting challenging prompts and utilizing LLM-as-a-Judge for automatic model evaluation. To validate benchmark quality, we propose new metrics to measure a benchmark's alignment with human preferences and ability to separate models. We release Arena-Hard-Auto, a benchmark consisting 500 challenging prompts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsALIGN