Reward Models Identify Consistency, Not Causality
Yuhui Xu, Hanze Dong, Lei Wang, Caiming Xiong, Junnan Li

TL;DR
This paper reveals that reward models for language models focus more on structural consistency and reasoning patterns than on actual causal correctness, highlighting a key limitation in current alignment techniques.
Contribution
The study demonstrates that reward models prioritize consistency over causality, challenging assumptions and suggesting the need for causality-aware reward modeling approaches.
Findings
Reward models rely heavily on structural consistency.
Removing problem statements minimally affects reward scores.
Disrupting reasoning flow significantly impacts reward outputs.
Abstract
Reward models (RMs) play a crucial role in aligning large language models (LLMs) with human preferences and enhancing reasoning quality. Traditionally, RMs are trained to rank candidate outputs based on their correctness and coherence. However, in this work, we present several surprising findings that challenge common assumptions about RM behavior. Our analysis reveals that state-of-the-art reward models prioritize structural consistency over causal correctness. Specifically, removing the problem statement has minimal impact on reward scores, whereas altering numerical values or disrupting the reasoning flow significantly affects RM outputs. Furthermore, RMs exhibit a strong dependence on complete reasoning trajectories truncated or incomplete steps lead to significant variations in reward assignments, indicating that RMs primarily rely on learned reasoning patterns rather than explicit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic Policies and Impacts
