Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Francesca Bianco, Derek Shiller

TL;DR
This study investigates how pain-pleasure decision-making is represented and causally influences transformer-based language models, revealing early separability of valence signals and distributed causal effects across multiple model components.
Contribution
It provides a detailed mechanistic analysis linking behavioral effects to internal representations and causal sites within an LLM, advancing interpretability and safety research.
Findings
Valence sign is linearly separable in early layers.
Graded intensity peaks in mid-to-late layers, especially attention/MLP outputs.
Causal effects are distributed across multiple heads, not localized.
Abstract
Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychology of Moral and Emotional Judgment · Neural and Behavioral Psychology Studies · Decision-Making and Behavioral Economics
