Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Zabir Al Nazi; GM Shahariar; Md. Abrar Hossain; Wei Peng

arXiv:2512.17394·cs.CL·January 8, 2026

Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?

Zabir Al Nazi, GM Shahariar, Md. Abrar Hossain, Wei Peng

PDF

Open Access

TL;DR

This paper introduces CulturalToM-VQA, a diverse benchmark for evaluating Vision-Language Models' ability to perform cross-cultural Theory of Mind reasoning, revealing significant performance gaps and biases in current models.

Contribution

The paper presents a new culturally diverse ToM benchmark and evaluates 10 VLMs, highlighting limitations in false belief reasoning, regional biases, and social desirability effects.

Findings

01

Frontier models achieve over 93% accuracy on ToM tasks.

02

Models struggle significantly with false belief reasoning (19-83%).

03

High regional variance (20-30%) observed across models.

Abstract

Theory of Mind (ToM) - the ability to attribute beliefs and intents to others - is fundamental for social intelligence, yet Vision-Language Model (VLM) evaluations remain largely Western-centric. In this work, we introduce CulturalToM-VQA, a benchmark of 5,095 visually situated ToM probes across diverse cultural contexts, rituals, and social norms. Constructed through a frontier proprietary MLLM, human-verified pipeline, the dataset spans a taxonomy of six ToM tasks and four complexity levels. We benchmark 10 VLMs (2023-2025) and observe a significant performance leap: while earlier models struggle, frontier models achieve high accuracy (>93%). However, significant limitations persist: models struggle with false belief reasoning (19-83% accuracy) and show high regional variance (20-30% gaps). Crucially, we find that SOTA models exhibit social desirability bias - systematically favoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Social Robot Interaction and HRI