Reward Models Inherit Value Biases from Pretraining

Brian Christian; Jessica A. F. Thompson; Elle Michelle Yang; Vincent Adam; Hannah Rose Kirk; Christopher Summerfield; Tsvetomira Dumbalska

arXiv:2601.20838·cs.LG·March 3, 2026

Reward Models Inherit Value Biases from Pretraining

Brian Christian, Jessica A. F. Thompson, Elle Michelle Yang, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska

PDF

Open Access 3 Reviews

TL;DR

This study reveals that reward models inherit and reflect the value biases of their base pretrained language models, affecting their alignment with human values and emphasizing the importance of considering pretraining choices in safety efforts.

Contribution

The paper demonstrates that reward models inherit value biases from their pretrained models, with empirical evidence showing persistent agency and communion preferences across different RMs.

Findings

01

RMs show significant value biases based on their base models.

02

Logit differences can be used as implicit reward scores reflecting these biases.

03

Biases are durable across different training conditions.

Abstract

Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pretrained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pretrained…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The authors took an interdisciplinary approach in viewing the question --- we should indeed borrow existing domain knowledge from disciplines that studies human values. - The statistical tests are careful in e.g., FDR control - The framing of model difference as reward difference is an interesting view point

Weaknesses

Exhaustive token search might itself have limitations, I am not sure how the prompting scheme have an influence in this process.

Reviewer 02Rating 6Confidence 4

Strengths

- This paper offers some fundamental insights for the research community on choosing base LLMs for RM training, which is underexplored. A focus shift from pure performance metrics to more fine-grained details on value biases is much needed these days. - The investigations done in the paper make sense and are quite novel, providing solid evidence of the inheritance traces of value biases. Experiments also cover diverse aspects. - Clarity is excellent - clear motivation, adequate and in-depth disc

Weaknesses

- The RMs used in Sections 3 and 4 are quite small (2B and 3B), somewhat limiting the significance of results and the validity of relevant claims. - Sections 3 and 4 focus on a binary value distinction between "Agency" and "Communion". This seems a bit arbitrary. It is also obvious that different types of LLMs (Llama vs. Gemma) would have systematic differences. I would assume that if I randomly choose two common value aspects to repeat the same investigations, I would observe different preferen

Reviewer 03Rating 2Confidence 3

Strengths

**S1.** The paper clearly traces moral-value biases (agency vs. communion) from the outputs of trained RMs back to the log-probabilities of the base pre-trained models, which provides a clear takeaway that the choice of base model for RM training is also a critical decision that will have downstream value implications. **S2.** The paper evaluates multiple open-weight RMs based on psycholinguistic validation through controlled ablations on data and base model selections.

Weaknesses

**W1.** The central claim that reward models (RMs) inherit biases from their base pretrained LLMs already feels intuitive and largely expected, given prior research demonstrating bias propagation across fine-tuning and alignment stages [1, 2]. Therefore, this makes the contribution of the paper primarily observational since it does not provide mechanistic interpretability, analysis of latent representations, or deeper causal insight into why such biases emerge. Furthermore, it doesn't offer any

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Mental Health via Writing