Re-evaluating Automatic LLM System Ranking for Alignment with Human   Preference

Mingqi Gao; Yixin Liu; Xinyu Hu; Xiaojun Wan; Jonathan Bragg; Arman; Cohan

arXiv:2501.00560·cs.CL·February 12, 2025

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference

Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman, Cohan

PDF

Open Access 1 Video

TL;DR

This paper critically re-evaluates automatic LLM system ranking methods, offering recommendations for component selection, revealing limitations in current benchmarks, and emphasizing the need for system-level evaluation to better align with human preferences.

Contribution

It provides a systematic analysis of automatic LLM benchers, offering guidelines for component choices and highlighting their limitations in ranking similar-performance models.

Findings

01

Component choices significantly affect ranking accuracy.

02

Automatic benchers struggle with similar-performance LLMs.

03

Instance-level model performance does not always predict bencher effectiveness.

Abstract

Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference· underline

Taxonomy

TopicsDigital Rights Management and Security

MethodsSparse Evolutionary Training · ALIGN