Mitigating Self-Preference by Authorship Obfuscation

Taslim Mahbub; Shi Feng

arXiv:2512.05379·cs.CL·December 8, 2025

Mitigating Self-Preference by Authorship Obfuscation

Taslim Mahbub, Shi Feng

PDF

Open Access

TL;DR

This paper explores methods to reduce self-preference bias in language model judges by obfuscating authorship through simple perturbations, revealing challenges in fully eliminating the bias.

Contribution

It introduces black-box perturbation techniques to mitigate self-preference in LM evaluations and analyzes their effectiveness and limitations.

Findings

01

Synonym replacement reduces self-preference

02

Complete stylistic neutralization is challenging

03

Self-recognition occurs on multiple semantic levels

Abstract

Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Topic Modeling · Artificial Intelligence in Healthcare and Education