MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang

TL;DR
This paper introduces MimeQA, a new dataset and benchmark for evaluating AI's ability to understand nonverbal social cues through mime videos, highlighting current models' limitations in nonverbal reasoning.
Contribution
The paper presents MimeQA, a novel dataset and benchmark for nonverbal social reasoning, and evaluates existing VideoLLMs, revealing their deficiencies in interpreting nonverbal cues.
Findings
VideoLLMs achieve 20-30% accuracy on MimeQA
Humans score 86% on the benchmark
Models struggle with grounding objects and understanding subtle interactions
Abstract
As AI becomes more closely integrated with peoples' daily activities, socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important. However, current works in AI social reasoning all rely on language-only or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel data source rich in nonverbal social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting nonverbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing ~8 hours of videos clips from YouTube and developing a comprehensive video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems
