Linear Representations of Sentiment in Large Language Models

Curt Tigges; Oskar John Hollinsworth; Atticus Geiger; Neel Nanda

arXiv:2310.15154·cs.LG·October 24, 2023·5 cites

Linear Representations of Sentiment in Large Language Models

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that sentiment in large language models is represented linearly by a specific direction in activation space, which is causally relevant and involves a small subset of model components, including a novel summarization motif.

Contribution

It reveals the linear structure of sentiment representation in LLMs, identifies the causal role of a specific direction, and introduces the summarization motif phenomenon.

Findings

01

Sentiment is linearly represented by a single direction in activation space.

02

Causal interventions confirm the importance of this sentiment direction.

03

Ablation of the sentiment direction significantly reduces classification accuracy.

Abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

curt-tigges/eliciting-latent-sentiment
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Natural Language Processing Techniques