CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models
Mengfan Li, Xuanhua Shi, Yang Deng

TL;DR
CoSToM is a framework that improves large language models' social reasoning by mapping internal ToM features and actively steering their activation for better alignment with human-like cognition.
Contribution
The paper introduces CoSToM, a novel causal-oriented steering method that enhances LLMs' intrinsic ToM capabilities through internal feature mapping and targeted activation control.
Findings
CoSToM significantly improves social reasoning in LLMs.
Internal ToM features are mapped via causal tracing.
Targeted activation steering enhances downstream dialogue quality.
Abstract
Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers' characteristics in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
