MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models
Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng

TL;DR
This paper introduces MME-Emotion, the largest benchmark for evaluating emotional intelligence in multimodal large language models, assessing their understanding and reasoning across diverse scenarios to identify strengths and limitations.
Contribution
It presents a comprehensive, scalable benchmark with over 6,000 video clips and hybrid metrics, enabling systematic evaluation of MLLMs' emotional understanding and reasoning capabilities.
Findings
Current MLLMs show limited emotional recognition accuracy.
Generalist models rely on broad multimodal understanding.
Specialist models achieve similar performance via domain-specific training.
Abstract
Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad…
Peer Reviews
Decision·ICLR 2026 Poster
The paper presents a comprehensive testing suite and tests on a wide variety of models. Tests reasoning vs. non-reasoning. It is interesting how the authors adapted video data for non-video models.
It is hard to know if the MLLMs don’t perform well on emotion tasks because they have trouble with modality fusion/their perception system is weaker or if it is truly a problem with emotion recognition. To test this the authors could have used one of the unimodal datasets cited in Table 1 and tested each model on a text-only version of the task or an image-only version of the task to test the model’s ability. That way the reader would be better informed if the issue was modality/modality-fusion
1. MME-Emotion has the leading benchmark scale and scenario coverage, which includes 6,500 curated video clips with task-specific QA pairs and 8 emotional tasks. This benchmark could enable fine-grained evaluation of model generalization and address the insufficient scenario coverage of existing benchmarks. 2. An automated evaluation suite is proposed. A multi-agent system-based evaluation framework is designed, which could evaluate the performance of MLLMs without manual annotation of reasonin
1. Compared with existing emotional intelligence benchmarks, the main contributions of this paper lie in two aspects: first, it incorporates relevant evaluations for emotional reasoning capabilities; second, it designs a large model-based automated evaluation algorithm. 2. The paper only considers various emotion recognition tasks and does not include emotion generation tasks. Can recognition-only tasks sufficiently and comprehensively assess the emotional intelligence of models? 3. The paper
1) MME-Emotion is the first holistic benchmark to evaluate both the presence of an emotion and the reason for it, offering a novel method that goes beyond mere classification. The multi-agent automated evaluation is a creative solution to the absence of annotated reasoning chains. 2) The large-scale benchmark comprises 6,500 clips, 27 scenarios and eight tasks, and is balanced in terms of duration and question distribution. Human validation of the automated scoring adds credibility. 3) The paper
1) Although the article acknowledges that specialized models (e.g. Audio-Reasoner) outperform their multimodal counterparts, it does not provide a systematic analysis of the contribution of individual modalities. Ablation experiments (e.g. running the same MLLM in audio-only, video-only and audio+video modes) could reveal whether the problem lies in the modalities' integration being inefficient or in noise/conflict between the modalities. I would like to see more detail on this, either in the fo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
