Debiasing Reward Models by Representation Learning with Guarantees
Ignavier Ng, Patrick Bl\"obaum, Siddharth Bhandari, Kun Zhang, Shiva Kasiviswanathan

TL;DR
This paper introduces a theoretically grounded framework for training reward models that reduces reliance on spurious correlations, improving robustness in aligning language models with human preferences.
Contribution
It provides a novel formulation and theoretical identification of non-spurious factors, along with a practical variational inference method for debiasing reward models.
Findings
Effectively mitigates spurious correlations in reward models
Improves robustness of language model alignment
Works on both synthetic and real-world datasets
Abstract
Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
