Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge
Yuzheng Xu, Tosho Hirasawa, Tadashi Kozuno, Yoshitaka Ushiku

TL;DR
This paper uncovers position bias in rubric-based LLM evaluation, showing that score options' positions influence judgments, and proposes a permutation strategy to mitigate bias and improve reliability.
Contribution
The study reveals position bias in rubric-based LLM evaluation and introduces a permutation method to reduce bias and enhance correlation with human judgments.
Findings
Position bias is consistent across models and datasets.
Balanced permutation reduces position bias and improves evaluation accuracy.
Permutation-based calibration enhances alignment with human judgments.
Abstract
Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Authorship Attribution and Profiling
