Why is Your Language Model a Poor Implicit Reward Model?
Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora

TL;DR
This paper investigates why implicit reward models (IM-RMs) in language models generalize worse than explicit reward models (EX-RMs), revealing that IM-RMs rely more on superficial token cues, affecting their out-of-distribution performance.
Contribution
The study provides a theoretical and experimental analysis showing that IM-RMs depend more on superficial cues, explaining their poorer generalization compared to EX-RMs.
Findings
IM-RMs rely more on superficial token-level cues.
IM-RMs generalize worse under token-level distribution shifts.
Alternative hypotheses for the generalization gap are challenged.
Abstract
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Toward a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial…
Peer Reviews
Decision·ICLR 2026 Poster
Sharp negative result against a popular hypothesis: verification ≠ generation for IM-RMs; the Hamiltonian task illustrates this cleanly. Mechanistic story that matches data. The gradient-level analysis predicts exactly where IM-RMs fail (paraphrases/translations), and the controlled Persona setup nails the failure mode. Breadth and consistency. Multiple model families (1B–8B) and both general-chat and math settings; token-shift brittleness shows up systematically. Practical takeaway. If your
Paraphrase pipeline dependence. Robustness claims rest heavily on how paraphrases/translations were produced. I’d like BLEU/chrF ranges, style diversity checks, and a sanity control where paraphrases are fed back through the teacher to confirm semantic equivalence. Limited exploration of mitigations. The paper diagnoses token-level sensitivity but doesn’t dig into cheap fixes (representation freezing, unembedding regularization, token-dropout on reward paths, contrastive paraphrase augmentation
- The paper addresses a well-defined and important problem: understanding the performance discrepancy between EX-RMs and IM-RMs, which are structurally very similar yet exhibit different generalization behaviors. - The paper effectively challenges the "generation-verification gap" hypothesis. It proves theoretically that verification with an IM-RM does not require generation and demonstrates this empirically on a synthetic Hamiltonian cycle task. - The paper’s claims are substantiated by a compr
- The primary theoretical analysis in Section 4 and Appendix B relies on simplifying assumptions that are violated in the main experiments, potentially limiting the direct applicability of the theory. 1) Assumption 1 posits that hidden representations are fixed during training. However, the empirical results are generated by training all reward model parameters. The paper notes that its conclusions still hold empirically, but does not fully bridge the gap to explain why the dynamics under the s
The strengths of the work are as follows. - First, the problem is relevant and training strong reward models are important for LLM training. - Second, there is a strong theoretical component, showing that EX-RMs and IMs have a gap. - Third, there is a strong empirical second, showing an identification of the root cause.
The weaknesses of the work are as follows. First, there is not a downstream performance analysis, and reward model performance does not always correlate with downstream performance. The work could benefit from a stronger empirical analysis that includes downstream model performance, although this is computationally challenging.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
