Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu; Laura Ruis; Tim Rockt\"aschel; Robert Kirk

arXiv:2502.14074·cs.AI·June 9, 2025

Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu, Laura Ruis, Tim Rockt\"aschel, Robert Kirk

PDF

Open Access

TL;DR

This paper reveals that LLM-based evaluation judges often exhibit non-transitive preferences, affecting the reliability of model rankings, and proposes methods like round-robin and Swiss-Wise tournaments to improve evaluation consistency.

Contribution

It uncovers non-transitivity in LLM judges and introduces tournament-based methods to enhance the reliability of model rankings in LLM evaluation.

Findings

01

LLM judges show non-transitive preferences affecting rankings.

02

Round-robin tournaments improve correlation with benchmark rankings.

03

Swiss-Wise tournaments offer a computationally efficient alternative.

Abstract

Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLegal Education and Practice Innovations · Legal Systems and Judicial Processes · Law, Economics, and Judicial Systems