Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son; Ju-Min Oh; Heegon Jin; Cheolhun Jang; Jeongbeom Jeong; Kuntae Kim

arXiv:2411.01281·cs.CL·October 29, 2025

Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons

Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim

PDF

Open Access 1 Datasets 1 Video

TL;DR

Arena-Lite introduces a tournament-based evaluation method for large language models that improves reliability and reduces comparisons by directly pitting models against each other, bypassing traditional baseline comparisons.

Contribution

It proposes a novel tournament structure for LLM evaluation that enhances reliability and efficiency over existing baseline-mediated benchmarks.

Findings

01

Higher reliability in system rankings with fewer comparisons.

02

Effective even with smaller datasets or weaker judges.

03

Demonstrated through controlled and empirical experiments.

Abstract

As Large Language Models (LLMs) expand across domains, LLM judges have become essential for systems evaluation. Current benchmarks typically compare system outputs against baselines. This baseline-mediated approach, though convenient, yields lower reliability than direct comparison between systems. We propose Arena-Lite which integrates tournament structure on top of head-to-head comparison. The application of a tournament structure and direct comparison eliminates the need for baseline outputs, reduces the number of required comparisons, and allows higher reliability in system rankings. We conducted two experiments: (1) controlled stochastic modeling and (2) empirical validation with a real LLM judge. Those experiments collectively demonstrate that Arena-Lite consistently achieves higher reliability with fewer comparisons, even with smaller datasets or weaker judges. We release an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

fgenie777/Arena-Lite-Experiments-Result-Data
dataset· 4 dl
4 dl

Videos

Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training