Can Vision Language Models Understand Mimed Actions?

Hyundong Cho; Spencer Lin; Tejas Srinivasan; Michael Saxon; Deuksin Kwon; Natali T. Chavez; Jonathan May

arXiv:2506.21586·cs.CL·August 8, 2025

Can Vision Language Models Understand Mimed Actions?

Hyundong Cho, Spencer Lin, Tejas Srinivasan, Michael Saxon, Deuksin Kwon, Natali T. Chavez, Jonathan May

PDF

Open Access

TL;DR

This paper introduces MIME, a new benchmark for evaluating vision-language models' understanding of mimed actions, revealing current models' significant performance gap compared to humans and highlighting the need for improved gesture recognition capabilities.

Contribution

The paper presents MIME, a novel video-based question answering benchmark for mimed actions, with diverse variations and perturbations to assess recognition robustness in vision-language models.

Findings

01

Models perform worse than humans on MIME.

02

Open-weight and API-based models show significant recognition gaps.

03

MIME highlights the need for more robust gesture understanding in models.

Abstract

Nonverbal communication (NVC) plays an integral role in human language, but studying NVC in general is challenging because of its broad scope and high variance in interpretation among individuals and cultures. However, mime -- the theatrical technique of suggesting intent using only gesture, expression, and movement -- is a subset of NVC that consists of explicit and embodied actions with much lower human interpretation variance. We argue that a solid understanding of mimed actions is a crucial prerequisite for vision-language models capable of interpreting and commanding more subtle aspects of NVC. Hence, we propose Mime Identification Multimodal Evaluation (MIME), a novel video-based question answering benchmark comprising of 86 mimed actions. Constructed with motion capture data, MIME consists of variations of each action with perturbations applied to the character, background, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHand Gesture Recognition Systems · Multimodal Machine Learning Applications · Action Observation and Synchronization