Cognitive models can reveal interpretable value trade-offs in language models

Sonia K. Murthy; Rosie Zhao; Jennifer Hu; Sham Kakade; Markus Wulfmeier; Peng Qian; Tomer Ullman

arXiv:2506.20666·cs.CL·March 3, 2026

Cognitive models can reveal interpretable value trade-offs in language models

Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that cognitive models can be used to interpret and evaluate value trade-offs in language models, revealing how they shift with prompts, training, and other factors, aiding better control of model behavior.

Contribution

It introduces a novel application of cognitive models to systematically analyze value trade-offs in language models across various settings.

Findings

01

Behavioral profiles shift predictably with goal prioritization

02

Small reasoning budgets amplify behavioral shifts

03

Post-training dynamics reveal early value shifts and influence of pretraining data

Abstract

Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here, we show that a leading cognitive model of polite speech can be used to systematically evaluate alignment-relevant trade-offs in language models via two encompassing settings: degrees of reasoning "effort" and system prompt manipulations in closed-source frontier models, and RL post-training dynamics of open-source models. Our results show that LLMs' behavioral profiles under the cognitive model a) shift predictably when they are prompted to prioritize certain goals, b) are…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- The authors propose create a behavioral "signature" for sycophancy (hypothesized as high presentational utility but low informational and social utility). They find that when models are prompted with a purely "social" goal, their inferred parameters converge to this "sycophantic" signature---this has some practical use, given increasing worry around syncophancy. - The experiments are quite thorough, using both open- and closed-source models, running a lot of ablations, and putting detailed res

Weaknesses

- Perhaps it's because I lack a proper cogsci background, but the specific framework used by Yoon et al. that was borrowed by the authors wasn't fully clear to me. The overall approach made sense, however. - The paper's conclusions about general LLM "value trade-offs" are based entirely on polite speech (i.e., judging a friend's cake or poem) . The authors concede that these cognitive models "are often bespoke to the target domain" and "do not easily generalize to the open-ended nature of natura

Reviewer 02Rating 8Confidence 4

Strengths

- Very well articulated and formulated discussion of weaknesses (section 6). - Important early step in LLM behavior analysis influenced by pre-existing cognitive models of behavior.

Weaknesses

- The experiment design as described arguably only probes the LLM's model of how it would expect others to behave in this scenario (i.e. this is a Theory of Mind task). It is unclear whether this directly predicts model overt behavior. Given appropriate analysis, this might be addressed by comparisons between LLM-as-judge, LLM-as-agent, and LLM-as-assistant perspectives, but analysis along this dimension seems to be absent. - Section 5.2 provides p-values but does not specify the test being use

Reviewer 03Rating 8Confidence 3

Strengths

Although the idea of using cognitive models to study the behavior of LLM models is not new, the setting the authors have chosen to study and the model of choice serves as a great example of how one can use cognitive modeling to draw insights about LLMs. I find both the setting and model to be ecological for the study of value trade-off in LLMs. The findings are intuitive and corroborate many existing findings.I also liked the fact the they considered both open and closed models, and used them a

Weaknesses

The RSA model explanation is concurrent a bit too dense, especially for ML audience. It would be extremely helpful if the authors can dig into the model a bit more, with examples to provide readers with more intuition and also interpretation of what different values of the fitted parameters mean. This is especially important as the rest of the papers build on the fact that the readers understand the model and its parameters well. The authors could have also motivated the datasets used for fine-

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies

MethodsBalanced Selection