CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?
Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan

TL;DR
This paper introduces CoMMET, a novel multimodal benchmark dataset designed to evaluate Large Language Models' Theory of Mind abilities across diverse mental states and multi-turn conversations, revealing current strengths and limitations.
Contribution
The paper presents CoMMET, the first multimodal, multi-turn ToM benchmark for LLMs, expanding evaluation scope beyond belief tasks and providing insights into models' social reasoning capabilities.
Findings
LLMs show varied performance across mental states.
Multi-turn evaluation reveals limitations in current models.
CoMMET enables comprehensive assessment of social cognition in LLMs.
Abstract
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining
