Reward Model Interpretability via Optimal and Pessimal Tokens
Brian Christian, Hannah Rose Kirk, Jessica A.F. Thompson, Christopher Summerfield, Tsvetomira Dumbalska

TL;DR
This paper analyzes reward models for language models by examining their responses to all possible tokens, revealing biases, asymmetries, and sensitivities that impact their alignment with human values.
Contribution
It introduces a comprehensive method to interpret reward models through exhaustive token analysis, uncovering biases and heterogeneity in their scoring behaviors.
Findings
Reward models show significant heterogeneity even when trained on similar objectives.
Models exhibit asymmetries in scoring high- vs low-value tokens.
Reward models are sensitive to prompt framing, reflecting human biases.
Abstract
Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
