TL;DR
DialToM is a benchmark to evaluate Large Language Models' ability to predict dialogue trajectories based on mental states, revealing strengths in mental state identification but weaknesses in forecasting social outcomes.
Contribution
Introduces DialToM, a novel benchmark for assessing both mental state prediction and trajectory forecasting in dialogue, highlighting reasoning gaps in current LLMs.
Findings
LLMs excel at identifying mental states but struggle with trajectory forecasting.
Most LLMs, except Gemini 3 Pro, fail to leverage mental states for social predictions.
Weak semantic alignment between human and LLM inferences.
Abstract
Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
