Cognitive models can reveal interpretable value trade-offs in language models
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman

TL;DR
This paper demonstrates that cognitive models can be used to interpret and evaluate value trade-offs in language models, revealing how they shift with prompts, training, and other factors, aiding better control of model behavior.
Contribution
It introduces a novel application of cognitive models to systematically analyze value trade-offs in language models across various settings.
Findings
Behavioral profiles shift predictably with goal prioritization
Small reasoning budgets amplify behavioral shifts
Post-training dynamics reveal early value shifts and influence of pretraining data
Abstract
Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here, we show that a leading cognitive model of polite speech can be used to systematically evaluate alignment-relevant trade-offs in language models via two encompassing settings: degrees of reasoning "effort" and system prompt manipulations in closed-source frontier models, and RL post-training dynamics of open-source models. Our results show that LLMs' behavioral profiles under the cognitive model a) shift predictably when they are prompted to prioritize certain goals, b) are…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors propose create a behavioral "signature" for sycophancy (hypothesized as high presentational utility but low informational and social utility). They find that when models are prompted with a purely "social" goal, their inferred parameters converge to this "sycophantic" signature---this has some practical use, given increasing worry around syncophancy. - The experiments are quite thorough, using both open- and closed-source models, running a lot of ablations, and putting detailed res
- Perhaps it's because I lack a proper cogsci background, but the specific framework used by Yoon et al. that was borrowed by the authors wasn't fully clear to me. The overall approach made sense, however. - The paper's conclusions about general LLM "value trade-offs" are based entirely on polite speech (i.e., judging a friend's cake or poem) . The authors concede that these cognitive models "are often bespoke to the target domain" and "do not easily generalize to the open-ended nature of natura
- Very well articulated and formulated discussion of weaknesses (section 6). - Important early step in LLM behavior analysis influenced by pre-existing cognitive models of behavior.
- The experiment design as described arguably only probes the LLM's model of how it would expect others to behave in this scenario (i.e. this is a Theory of Mind task). It is unclear whether this directly predicts model overt behavior. Given appropriate analysis, this might be addressed by comparisons between LLM-as-judge, LLM-as-agent, and LLM-as-assistant perspectives, but analysis along this dimension seems to be absent. - Section 5.2 provides p-values but does not specify the test being use
Although the idea of using cognitive models to study the behavior of LLM models is not new, the setting the authors have chosen to study and the model of choice serves as a great example of how one can use cognitive modeling to draw insights about LLMs. I find both the setting and model to be ecological for the study of value trade-off in LLMs. The findings are intuitive and corroborate many existing findings.I also liked the fact the they considered both open and closed models, and used them a
The RSA model explanation is concurrent a bit too dense, especially for ML audience. It would be extremely helpful if the authors can dig into the model a bit more, with examples to provide readers with more intuition and also interpretation of what different values of the fitted parameters mean. This is especially important as the rest of the papers build on the fact that the readers understand the model and its parameters well. The authors could have also motivated the datasets used for fine-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
MethodsBalanced Selection
