Detecting Prefix Bias in LLM-based Reward Models

Ashwin Kumar; Yuzi He; Aram H. Markosyan; Bobbie Chern; Imanol Arrieta-Ibarra

arXiv:2505.13487·cs.CL·November 3, 2025

Detecting Prefix Bias in LLM-based Reward Models

Ashwin Kumar, Yuzi He, Aram H. Markosyan, Bobbie Chern, Imanol Arrieta-Ibarra

PDF

Open Access

TL;DR

This paper identifies and evaluates prefix bias in language model reward systems, revealing racial and gender biases across datasets and architectures, and proposes a mitigation strategy to improve fairness.

Contribution

Introduces novel methods to detect and evaluate prefix bias in LLM reward models, and proposes a data augmentation approach to mitigate such biases.

Findings

01

Significant racial and gender biases detected in reward models

02

Bias persists across different model architectures and datasets

03

Data augmentation reduces the impact of prefix bias

Abstract

Reinforcement Learning with Human Feedback (RLHF) has emerged as a key paradigm for task-specific fine-tuning of language models using human preference data. While numerous publicly available preference datasets provide pairwise comparisons of responses, the potential for biases in the resulting reward models remains underexplored. In this work, we introduce novel methods to detect and evaluate prefix bias -- a systematic shift in model preferences triggered by minor variations in query prefixes -- in LLM-based reward models trained on such datasets. We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions. Our comprehensive evaluation spans diverse open-source preference datasets and reward model architectures, demonstrating susceptibility to this kind of bias regardless of the underlying model architecture. Furthermore, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Ethics and Social Impacts of AI