Mitigating Self-Preference by Authorship Obfuscation
Taslim Mahbub, Shi Feng

TL;DR
This paper explores methods to reduce self-preference bias in language model judges by obfuscating authorship through simple perturbations, revealing challenges in fully eliminating the bias.
Contribution
It introduces black-box perturbation techniques to mitigate self-preference in LM evaluations and analyzes their effectiveness and limitations.
Findings
Synonym replacement reduces self-preference
Complete stylistic neutralization is challenging
Self-recognition occurs on multiple semantic levels
Abstract
Language models (LMs) judges are widely used to evaluate the quality of LM outputs. Despite many advantages, LM judges display concerning biases that can impair their integrity in evaluations. One such bias is self-preference: LM judges preferring their own answers over those produced by other LMs or humans. The bias is hard to eliminate as frontier LM judges can distinguish their own outputs from those of others, even when the evaluation candidates are not labeled with their sources. In this paper, we investigate strategies to mitigate self-preference by reducing the LM judges' ability to recognize their own outputs. We apply black-box perturbations to evaluation candidates in pairwise comparison to obfuscate the authorship and reduce self-recognition. We find that perturbations as simple as synonym replacement for a few words predictably reduce self-preference. However, we also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Topic Modeling · Artificial Intelligence in Healthcare and Education
