Estimating the Error of Large Language Models at Pairwise Text Comparison
Tianyi Li

TL;DR
This paper introduces a method to estimate large language models' error rates in pairwise text comparison tasks without ground truth, revealing biases and scalability issues, and compares several models' performance.
Contribution
The paper proposes a novel, ground-truth-independent approach to measure LLMs' error rates and biases in pairwise comparisons, improving understanding of their comparative performance.
Findings
Claude showed the most desirable performance considering error and robustness.
The method outperforms biased Bradley-Terry and commutativity score models.
Estimated error rates were consistent across different LLMs and input types.
Abstract
We measure LLMs' output error at pairwise text comparison, noting the probability of error in their preferences. Our method does not rely on the ground truth and supports two scenarios: (i) uniform error rate regardless of the order of comparison, estimated with two comparisons for each text pair with either text placed first; (ii) binary positional bias assuming distinct error rates for the two orders of comparison, estimated with repeated comparisons between the texts. The Copeland counting constructs a ranking over the compared texts from pairwise preferences; the ranking reveals the poor scalability of LLM-based pairwise comparison and helps yield the estimates for LLMs' error rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Qwen) with five types of text input and obtain consistent estimates of LLMs' error. In general, the measured two positional bias…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
