Mechanistic Decoding of Cognitive Constructs in Large Language Models
Yitong Shou, Manhao Guan

TL;DR
This paper introduces a novel interpretability framework to decode complex emotions like jealousy in large language models, revealing their internal psychological structure and enabling targeted interventions.
Contribution
It develops a Cognitive Reverse-Engineering framework combining appraisal theory and causal methods to analyze and manipulate emotional representations in LLMs.
Findings
Models encode jealousy as a linear combination of psychological factors.
Internal representations align with human psychological constructs.
Toxic emotional states can be detected and suppressed through the framework.
Abstract
While Large Language Models (LLMs) demonstrate increasingly sophisticated affective capabilities, the internal mechanisms by which they process complex emotions remain unclear. Existing interpretability approaches often treat models as black boxes or focus on coarse-grained basic emotions, leaving the cognitive structure of more complex affective states underexplored. To bridge this gap, we propose a Cognitive Reverse-Engineering framework based on Representation Engineering (RepE) to analyze social-comparison jealousy. By combining appraisal theory with subspace orthogonalization, regression-based weighting, and bidirectional causal steering, we isolate and quantify two psychological antecedents of jealousy, Superiority of Comparison Person and Domain Self-Definitional Relevance, and examine their causal effects on model judgments. Experiments on eight LLMs from the Llama, Qwen, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
