TL;DR
EmoTrans is a comprehensive benchmark designed to evaluate multimodal large language models' ability to understand, reason about, and predict emotion transitions in dynamic social video scenarios.
Contribution
The paper introduces EmoTrans, a new benchmark with annotated videos and QA pairs to assess emotion dynamics understanding in multimodal models, covering four progressive tasks.
Findings
Current models perform well on coarse emotion change detection but struggle with fine-grained dynamics.
Multi-person social scenarios are particularly challenging for existing models.
Reasoning tasks do not always improve model performance significantly.
Abstract
Recent multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and generation, and are increasingly used in applications such as social robots and human-computer interaction, where understanding human emotions is essential. However, existing benchmarks mainly formulate emotion understanding as a static recognition problem, leaving it largely unclear whether current MLLMs can understand emotion as a dynamic process that evolves, shifts between states, and unfolds across diverse social contexts. To bridge this gap, we present EmoTrans, a benchmark for evaluating emotion dynamics understanding in multimodal videos. EmoTrans contains 1,000 carefully collected and manually annotated video clips, covering 12 real-world scenarios, and further provides over 3,000 task-specific question-answer (QA) pairs for fine-grained evaluation. The benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
