Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu, Laura Ruis, Tim Rockt\"aschel, Robert Kirk

TL;DR
This paper reveals that LLM-based evaluation judges often exhibit non-transitive preferences, affecting the reliability of model rankings, and proposes methods like round-robin and Swiss-Wise tournaments to improve evaluation consistency.
Contribution
It uncovers non-transitivity in LLM judges and introduces tournament-based methods to enhance the reliability of model rankings in LLM evaluation.
Findings
LLM judges show non-transitive preferences affecting rankings.
Round-robin tournaments improve correlation with benchmark rankings.
Swiss-Wise tournaments offer a computationally efficient alternative.
Abstract
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations · Legal Systems and Judicial Processes · Law, Economics, and Judicial Systems
