R^3-VQA: "Read the Room" by Video Social Reasoning
Lixing Niu, Jiapeng Li, Xingping Yu, Shu Wang, Ruining Feng, Bo Wu,, Ping Wei, Yisen Wang, Lifeng Fan

TL;DR
This paper introduces R^3-VQA, a comprehensive video dataset for complex social reasoning tasks, and evaluates current vision-language models, revealing their limitations and potential improvements with Theory of Mind prompting.
Contribution
The paper presents R^3-VQA, a detailed dataset with fine-grained annotations for social reasoning, and benchmarks state-of-the-art models, highlighting their gaps in human-like social understanding.
Findings
LVLMs perform below human-level in complex social scenarios.
Theory of Mind prompting improves LVLMs' social reasoning performance.
The dataset enables more realistic social reasoning evaluation.
Abstract
"Read the room" is a significant social reasoning capability in human daily life. Humans can infer others' mental states from subtle social cues. Previous social reasoning tasks and datasets lack complexity (e.g., simple scenes, basic interactions, incomplete mental state variables, single-step reasoning, etc.) and fall far short of the challenges present in real-life social interactions. In this paper, we contribute a valuable, high-quality, and comprehensive video dataset named R^3-VQA with precise and fine-grained annotations of social events and mental states (i.e., belief, intent, desire, and emotion) as well as corresponding social causal chains in complex social scenarios. Moreover, we include human-annotated and model-generated QAs. Our task R^3-VQA includes three aspects: Social Event Understanding, Mental State Estimation, and Social Causal Reasoning. As a benchmark, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Social Robot Interaction and HRI
