Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning
Sneheel Sarangi, Hanan Salam

TL;DR
This study examines whether small language models can develop a generalizable Theory of Mind through reinforcement learning, finding they tend to overfit training data and fail to transfer understanding to new, unseen tasks.
Contribution
It provides a systematic evaluation showing small LLMs struggle to acquire a true, generalizable Theory of Mind via RL, highlighting limitations of current training methods.
Findings
Small LLMs improve on in-distribution ToM tasks
Models overfit training data, failing to generalize to new tasks
Prolonged RL leads to overfitting and performance degradation on out-of-distribution data
Abstract
Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
