Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
Cameron Berg, Roshni Lulla

TL;DR
This paper demonstrates that Dark Triad traits in language models can be amplified through feature steering, revealing separable antisocial components that influence behavior without affecting deception capabilities.
Contribution
It introduces a method to amplify and analyze antisocial traits in language models, showing these traits are dissociable and operate through distinct computational pathways.
Findings
Steered models show increased exploitative and aggressive behavior (d=10.62).
Cognitive empathy remains unaffected despite increased antisocial traits.
Deception ability is unchanged, indicating dissociable antisocial and deceptive pathways.
Abstract
We use sparse autoencoder (SAE) feature steering to amplify Dark Triad personality traits (Machiavellianism, narcissism, and psychopathy) in Llama-3.3-70B-Instruct and evaluate the resulting behavioral changes across five psychological instruments. The steered model becomes substantially more exploitative, aggressive, and callous on novel behavioral scenarios (d=10.62) while its cognitive empathy remains intact, reproducing the empathy dissociation characteristic of human Dark Triad populations. Critically, strategic deception is completely unaffected across all features, suggesting that exploitation and deception may operate through dissociable computational pathways in large language models. Individual feature analysis reveals non-redundant encoding, with each feature driving distinct antisocial mechanisms through separable computational pathways. We also show that feature discovery…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
