Factored Causal Representation Learning for Robust Reward Modeling in RLHF
Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Fan Feng, Biwei Huang, Shikui Tu, Lei Xu

TL;DR
This paper introduces a causal factorization approach to improve reward models in RLHF, reducing reward hacking by isolating causal features and suppressing non-causal biases.
Contribution
It proposes a novel factored representation learning framework that enhances reward model robustness by disentangling causal and non-causal factors from contextual embeddings.
Findings
Learned more robust reward models in mathematical and dialogue tasks.
Improved downstream RLHF performance over state-of-the-art baselines.
Effectively mitigated reward hacking behaviors related to length and bias.
Abstract
A reliable reward model is essential for aligning large language models with human preferences through reinforcement learning from human feedback. However, standard reward models are susceptible to spurious features that are not causally related to human labels. This can lead to reward hacking, where high predicted reward does not translate into better behavior. In this work, we address this problem from a causal perspective by proposing a factored representation learning framework that decomposes the model's contextual embedding into (1) causal factors that are sufficient for reward prediction and (2) non-causal factors that capture reward-irrelevant attributes such as length or sycophantic bias. The reward head is then constrained to depend only on the causal component. In addition, we introduce an adversarial head trained to predict reward from the non-causal factors, while applying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare
