Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
Zabir Al Nazi, GM Shahariar, Md. Abrar Hossain, Wei Peng

TL;DR
This paper introduces CulturalToM-VQA, a diverse benchmark for evaluating Vision-Language Models' ability to perform cross-cultural Theory of Mind reasoning, revealing significant performance gaps and biases in current models.
Contribution
The paper presents a new culturally diverse ToM benchmark and evaluates 10 VLMs, highlighting limitations in false belief reasoning, regional biases, and social desirability effects.
Findings
Frontier models achieve over 93% accuracy on ToM tasks.
Models struggle significantly with false belief reasoning (19-83%).
High regional variance (20-30%) observed across models.
Abstract
Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Social Robot Interaction and HRI
