Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
Mingqi Gao, Yixin Liu, Xinyu Hu, Xiaojun Wan, Jonathan Bragg, Arman, Cohan

TL;DR
This paper critically re-evaluates automatic LLM system ranking methods, offering recommendations for component selection, revealing limitations in current benchmarks, and emphasizing the need for system-level evaluation to better align with human preferences.
Contribution
It provides a systematic analysis of automatic LLM benchers, offering guidelines for component choices and highlighting their limitations in ranking similar-performance models.
Findings
Component choices significantly affect ranking accuracy.
Automatic benchers struggle with similar-performance LLMs.
Instance-level model performance does not always predict bencher effectiveness.
Abstract
Evaluating and ranking the capabilities of different LLMs is crucial for understanding their performance and alignment with human preferences. Due to the high cost and time-consuming nature of human evaluations, an automatic LLM bencher (i.e., an automatic evaluation framework that aims to rank LLMs based on their alignment with human preferences) is indispensable. An automatic LLM bencher consists of four components: the input set (e.g., a user instruction), the evaluation model (e.g., an LLM), the evaluation type (e.g., pairwise comparison), and the aggregation method (e.g., the ELO rating system). However, previous work has not thoroughly explored how to select these components or how their different combinations influence the results. In this work, through controlled experiments, we provide a series of recommendations on how to choose each component to better automate the evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDigital Rights Management and Security
MethodsSparse Evolutionary Training · ALIGN
