MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Xiaoyu Yuan; Niklas Heikkala; Tiina T\"orm\"anen; Hanna J\"arvenoja; Guoying Zhao; Haoyu Chen

arXiv:2605.09703·cs.CV·May 12, 2026

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Xiaoyu Yuan, Niklas Heikkala, Tiina T\"orm\"anen, Hanna J\"arvenoja, Guoying Zhao, Haoyu Chen

PDF

TL;DR

MOTOR-Bench introduces a real-world multimodal dataset and a multi-agent framework for zero-shot understanding of human mental states in complex social interactions.

Contribution

The paper presents MOTOR-Bench, a new dataset with expert annotations and a multi-agent reasoning system, advancing structured mental state analysis from observable behavior.

Findings

01

Multi-agent framework outperforms single-model benchmarks by 15.93 points in Macro-F1.

02

Existing multimodal models show limited performance on structured mental state inference.

03

MOTOR-Bench captures real-world challenges like class imbalance and visual noise.

Abstract

Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.