Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu

TL;DR
This paper identifies and empirically demonstrates the bias introduced by preference leakage in LLM-based evaluation, where related models influence judgment outcomes, posing a significant contamination challenge.
Contribution
It defines the preference leakage problem in LLM-as-a-judge, categorizes relatedness types, and empirically confirms bias across multiple benchmarks and models.
Findings
Preference leakage causes judges to favor related models.
Bias persists across different LLM baselines and benchmarks.
Preference leakage is more difficult to detect than previous biases.
Abstract
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across…
Peer Reviews
Decision·ICLR 2026 Poster
The primary strength is the paper's originality. The formalization of "preference leakage" as a distinct contamination vector, separate from both traditional data leakage and simple egocentric bias, is a novel and important conceptual contribution. Also, the empirical work is high-quality. The authors conduct a thorough and "full-factorial" investigation across multiple generators, students, and benchmarks . The further analyses (Section 5) are comprehensive and anticipate many of the reader's q
1. The paper's main weakness is the preliminary and somewhat disconnected nature of the mitigation analysis (Section 5.7). While the paper excels at diagnosing the problem, the treatment section feels like an add-on. The mitigation experiments use a different setup (new datasets like PPE/MTBench, new "Error Bias" metric) than the main experiments (Arena-Hard/AlpacaEval 42, $PLS$ metric). This makes it hard to connect the findings. For example, how does the best mitigation (Contextual Calibration
* The paper fills a gap in prior research on LLM-as-a-Judge biases (e.g., egocentric bias) by focusing on the subtle, synthetic data-mediated bias between related models, rather than simpler self-favoritism. * The paper maintains good clarity in structure and expression. It logically organizes content from problem definition to experimental design, results, and mechanism analysis.
* Mechanistic Analysis Insufficiency: While the paper attributes leakage to "spurious features (style/format)" inherited by student models, it does not empirically validate these features. It could use feature attribution methods (e.g., SHAP, LIME) to identify which specific stylistic/formatting elements (e.g., sentence structure, terminology) drive M_J’s bias, or conduct ablation studies (e.g., paraphrasing synthetic data to remove style) to test if leakage diminishes. * Real-World Impact Evi
1 Novel and important concept: The notion of preference leakage extends beyond existing bias categories and raises an underexplored but crucial reliability issue in LLM evaluation. 2 Systematic empirical validation: Comprehensive experiments using diverse generator-judge relations, datasets (Arena-Hard, AlpacaEval 2.0), and mitigation trials. 3 Strong methodological framework: Clear definitions of “relatedness” and formalization of leakage conditions make this paper theoretically grounded. 4
1 The discussion of features embedded in student models (Sec 5.5) is promising but under-analyzed; deeper probing or visualization (e.g., stylistic feature attribution) would enrich understanding. 2 Some family coverage is narrow—adding DeepSeek and Grok series could strengthen generality and industry relevance. 3 If “preference” is positioned as a form of affective bias, connections to affective-analysis or sentiment-evaluation benchmarks would contextualize it better. 4 While mitigation met
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLaw, Economics, and Judicial Systems · Dispute Resolution and Class Actions · Artificial Intelligence in Law
MethodsSoftmax · Attention Is All You Need
