Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

TL;DR
This paper introduces a causally motivated intervention method to mitigate multiple biases in reward models at inference time, improving alignment without performance trade-offs.
Contribution
It proposes a neuron-level intervention technique that suppresses bias-related signals, addressing multiple biases simultaneously in reward models.
Findings
Reduces sensitivity to spurious features across diverse bias types.
Enables small RMs to improve LLM alignment, matching larger models' performance.
Bias signals are mainly encoded in neurons in early layers.
Abstract
Reward models (RMs) play a central role in aligning large language models (LLMs) with human preferences. However, RMs are often sensitive to spurious features such as response length. Existing inference-time approaches for mitigating these biases typically focus exclusively on response length, resulting in performance trade-offs. In this paper, we propose causally motivated intervention for mitigating multiple types of biases in RMs at inference time. Our method first identifies neurons whose activations are strongly correlated with predefined bias attributes, and applies neuron-level intervention that suppresses these signals. We evaluate our method on RM benchmarks and observe reductions in sensitivity to spurious features across diverse bias types, without inducing performance trade-offs. Moreover, when used for preference annotation, small RMs (2B and 7B) with our method, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
