Rectifying Shortcut Behaviors in Preference-based Reward Learning

Wenqian Ye; Guangtao Zheng; Aidong Zhang

arXiv:2510.19050·cs.AI·October 23, 2025

Rectifying Shortcut Behaviors in Preference-based Reward Learning

Wenqian Ye, Guangtao Zheng, Aidong Zhang

PDF

Open Access

TL;DR

This paper introduces PRISM, a method to reduce shortcut behaviors in preference-based reward learning, improving model robustness and generalization in aligning language models with human preferences.

Contribution

The paper proposes a novel invariant kernel approach, PRISM, to mitigate shortcut exploitation in reward models, enhancing out-of-distribution performance and alignment robustness.

Findings

01

Improves reward model accuracy on out-of-distribution tasks.

02

Reduces dependency on shortcuts in downstream policies.

03

Establishes a robust framework for preference-based alignment.

Abstract

In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI) · Topic Modeling