MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Jinjie Ni; Fuzhao Xue; Xiang Yue; Yuntian Deng; Mahir Shah; Kabir; Jain; Graham Neubig; Yang You

arXiv:2406.06565·cs.CL·October 15, 2024·2 cites

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir, Jain, Graham Neubig, Yang You

PDF

Open Access 2 Datasets 1 Video

TL;DR

MixEval introduces a novel benchmark mixing approach that combines real-world user queries with existing benchmarks, providing a reliable, efficient, and dynamic evaluation method for large language models that correlates well with user-based assessments.

Contribution

The paper presents MixEval, a new paradigm for LLM evaluation that strategically combines benchmark data with web-mined queries, improving reliability, efficiency, and dynamic updating capabilities.

Findings

01

Achieves 0.96 correlation with Chatbot Arena rankings.

02

Runs 94% faster and cheaper than MMLU.

03

Enables dynamic and reproducible evaluation updates.

Abstract

Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while LLM-as-judge benchmarks suffer from grading biases and limited query quantity. Both of them may also become contaminated over time. User-facing evaluation, such as Chatbot Arena, provides reliable signals but is costly and slow. In this work, we propose MixEval, a new paradigm for establishing efficient, gold-standard LLM evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks. Based on MixEval, we further build MixEval-Hard, which offers more room for model improvement. Our benchmarks'…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures· slideslive

Taxonomy

TopicsStatistical and Computational Modeling · Business Process Modeling and Analysis · Digital Rights Management and Security