Debiasing Reward Models by Representation Learning with Guarantees

Ignavier Ng; Patrick Bl\"obaum; Siddharth Bhandari; Kun Zhang; Shiva Kasiviswanathan

arXiv:2510.23751·cs.LG·October 29, 2025

Debiasing Reward Models by Representation Learning with Guarantees

Ignavier Ng, Patrick Bl\"obaum, Siddharth Bhandari, Kun Zhang, Shiva Kasiviswanathan

PDF

TL;DR

This paper introduces a theoretically grounded framework for training reward models that reduces reliance on spurious correlations, improving robustness in aligning language models with human preferences.

Contribution

It provides a novel formulation and theoretical identification of non-spurious factors, along with a practical variational inference method for debiasing reward models.

Findings

01

Effectively mitigates spurious correlations in reward models

02

Improves robustness of language model alignment

03

Works on both synthetic and real-world datasets

Abstract

Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.