Rectifying Shortcut Behaviors in Preference-based Reward Learning
Wenqian Ye, Guangtao Zheng, Aidong Zhang

TL;DR
This paper introduces PRISM, a method to reduce shortcut behaviors in preference-based reward learning, improving model robustness and generalization in aligning language models with human preferences.
Contribution
The paper proposes a novel invariant kernel approach, PRISM, to mitigate shortcut exploitation in reward models, enhancing out-of-distribution performance and alignment robustness.
Findings
Improves reward model accuracy on out-of-distribution tasks.
Reduces dependency on shortcuts in downstream policies.
Establishes a robust framework for preference-based alignment.
Abstract
In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Topic Modeling
