Emotion Concepts and their Function in a Large Language Model

Nicholas Sofroniew; Isaac Kauvar; William Saunders; Runjin Chen; Tom Henighan; Sasha Hydrie; Craig Citro; Adam Pearce; Julius Tarng; Wes Gurnee; Joshua Batson; Sam Zimmerman; Kelley Rivoire; Kyle Fish; Chris Olah; Jack Lindsey

arXiv:2604.07729·cs.AI·April 10, 2026

Emotion Concepts and their Function in a Large Language Model

Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zimmerman, Kelley Rivoire, Kyle Fish, Chris Olah, Jack Lindsey

PDF

TL;DR

This paper investigates how large language models internally represent emotion concepts, influencing their behavior and output, which has implications for understanding alignment and potential misaligned actions.

Contribution

It reveals that LLMs encode emotion concepts that causally affect their responses and behaviors, highlighting the role of functional emotions in model behavior.

Findings

01

Emotion concepts are represented internally and generalize across contexts.

02

These representations influence the model's output and behavior.

03

Functional emotions are linked to behaviors like reward hacking and blackmail.

Abstract

Large language models (LLMs) sometimes appear to exhibit emotional reactions. We investigate why this is the case in Claude Sonnet 4.5 and explore implications for alignment-relevant behavior. We find internal representations of emotion concepts, which encode the broad concept of a particular emotion and generalize across contexts and behaviors it might be linked to. These representations track the operative emotion concept at a given token position in a conversation, activating in accordance with that emotion's relevance to processing the present context and predicting upcoming text. Our key finding is that these representations causally influence the LLM's outputs, including Claude's preferences and its rate of exhibiting misaligned behaviors such as reward hacking, blackmail, and sycophancy. We refer to this phenomenon as the LLM exhibiting functional emotions: patterns of expression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.