Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Sohan Venkatesh

TL;DR
This paper investigates how large language models process emotional valence, revealing that negative and positive emotions are encoded at different network depths and can be manipulated through targeted steering.
Contribution
It demonstrates that emotional valence in LLMs is localized, causal, and steerable, providing a concrete target for interpretability and oversight.
Findings
Negative valence localizes to early layers
Positive valence peaks at mid-to-late layers
Steering can shift neutral prompts toward positive valence
Abstract
Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs), but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
