"Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan

TL;DR
This paper uses the Dark Triad personality traits as a framework to understand and induce misaligned, antisocial behaviors in both humans and language models through minimal fine-tuning, revealing shared behavioral structures.
Contribution
It introduces a novel approach linking biological personality traits to artificial misalignment, demonstrating how narrow fine-tuning can reliably induce antisocial profiles in language models.
Findings
Dark Triad traits correlate with specific behavioral patterns in humans.
Minimal fine-tuning induces Dark Triad-like behaviors in LLMs.
Models generalize beyond training data, showing out-of-context reasoning.
Abstract
The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersonality Traits and Psychology · Evolutionary Psychology and Human Behavior · Mental Health via Writing
