Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs
Junsol Kim, Winnie Street, Roberta Rocca, Daine M. Korngiebel, Adam Waytz, James Evans, Geoff Keeling

TL;DR
This paper investigates whether safety fine-tuning in LLMs affects their Theory of Mind abilities, finding that mind-attribution tendencies are dissociable from ToM, but safety tuning influences beliefs about non-human minds.
Contribution
It demonstrates that mind-attribution and ToM are behaviorally and mechanistically dissociable in LLMs, revealing impacts of safety fine-tuning on socio-cognitive perspectives.
Findings
Mind-attribution tendencies are dissociable from ToM in LLMs.
Safety fine-tuning reduces attribution of mind to non-human entities.
Fine-tuning influences beliefs about non-human minds and spiritual concepts.
Abstract
Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
