Linear Representations of Sentiment in Large Language Models
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

TL;DR
This paper demonstrates that sentiment in large language models is represented linearly by a specific direction in activation space, which is causally relevant and involves a small subset of model components, including a novel summarization motif.
Contribution
It reveals the linear structure of sentiment representation in LLMs, identifies the causal role of a specific direction, and introduces the summarization motif phenomenon.
Findings
Sentiment is linearly represented by a single direction in activation space.
Causal interventions confirm the importance of this sentiment direction.
Ablation of the sentiment direction significantly reduces classification accuracy.
Abstract
Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques
