MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang; Zebang Cheng; Chong Deng; Haoxuan Li; Zheng Lian; Qian Chen; Huadai Liu; Wen Wang; Yi-Fan Zhang; Renrui Zhang; Ziyu Guo; Zhihong Zhu; Hao Wu; Haixin Wang; Yefeng Zheng; Xiaojiang Peng; Xian Wu; Kun Wang; Xiangang Li; Jieping Ye; Pheng-Ann Heng

arXiv:2508.09210·cs.CV·February 12, 2026

MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal Large Language Models

Fan Zhang, Zebang Cheng, Chong Deng, Haoxuan Li, Zheng Lian, Qian Chen, Huadai Liu, Wen Wang, Yi-Fan Zhang, Renrui Zhang, Ziyu Guo, Zhihong Zhu, Hao Wu, Haixin Wang, Yefeng Zheng, Xiaojiang Peng, Xian Wu, Kun Wang, Xiangang Li, Jieping Ye, Pheng-Ann Heng

PDF

3 Reviews

TL;DR

This paper introduces MME-Emotion, the largest benchmark for evaluating emotional intelligence in multimodal large language models, assessing their understanding and reasoning across diverse scenarios to identify strengths and limitations.

Contribution

It presents a comprehensive, scalable benchmark with over 6,000 video clips and hybrid metrics, enabling systematic evaluation of MLLMs' emotional understanding and reasoning capabilities.

Findings

01

Current MLLMs show limited emotional recognition accuracy.

02

Generalist models rely on broad multimodal understanding.

03

Specialist models achieve similar performance via domain-specific training.

Abstract

Recent advances in multimodal large language models (MLLMs) have catalyzed transformative progress in affective computing, enabling models to exhibit emergent emotional intelligence. Despite substantial methodological progress, current emotional benchmarks remain limited, as it is still unknown: (a) the generalization abilities of MLLMs across distinct scenarios, and (b) their reasoning capabilities to identify the triggering factors behind emotional states. To bridge these gaps, we present \textbf{MME-Emotion}, a systematic benchmark that assesses both emotional understanding and reasoning capabilities of MLLMs, enjoying \textit{scalable capacity}, \textit{diverse settings}, and \textit{unified protocols}. As the largest emotional intelligence benchmark for MLLMs, MME-Emotion contains over 6,000 curated video clips with task-specific questioning-answering (QA) pairs, spanning broad…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

The paper presents a comprehensive testing suite and tests on a wide variety of models. Tests reasoning vs. non-reasoning. It is interesting how the authors adapted video data for non-video models.

Weaknesses

It is hard to know if the MLLMs don’t perform well on emotion tasks because they have trouble with modality fusion/their perception system is weaker or if it is truly a problem with emotion recognition. To test this the authors could have used one of the unimodal datasets cited in Table 1 and tested each model on a text-only version of the task or an image-only version of the task to test the model’s ability. That way the reader would be better informed if the issue was modality/modality-fusion

Reviewer 02Rating 8Confidence 4

Strengths

1. MME-Emotion has the leading benchmark scale and scenario coverage, which includes 6,500 curated video clips with task-specific QA pairs and 8 emotional tasks. This benchmark could enable fine-grained evaluation of model generalization and address the insufficient scenario coverage of existing benchmarks. 2. An automated evaluation suite is proposed. A multi-agent system-based evaluation framework is designed, which could evaluate the performance of MLLMs without manual annotation of reasonin

Weaknesses

1. Compared with existing emotional intelligence benchmarks, the main contributions of this paper lie in two aspects: first, it incorporates relevant evaluations for emotional reasoning capabilities; second, it designs a large model-based automated evaluation algorithm. 2. The paper only considers various emotion recognition tasks and does not include emotion generation tasks. Can recognition-only tasks sufficiently and comprehensively assess the emotional intelligence of models? 3. The paper

Reviewer 03Rating 6Confidence 5

Strengths

1) MME-Emotion is the first holistic benchmark to evaluate both the presence of an emotion and the reason for it, offering a novel method that goes beyond mere classification. The multi-agent automated evaluation is a creative solution to the absence of annotated reasoning chains. 2) The large-scale benchmark comprises 6,500 clips, 27 scenarios and eight tasks, and is balanced in terms of duration and question distribution. Human validation of the automated scoring adds credibility. 3) The paper

Weaknesses

1) Although the article acknowledges that specialized models (e.g. Audio-Reasoner) outperform their multimodal counterparts, it does not provide a systematic analysis of the contribution of individual modalities. Ablation experiments (e.g. running the same MLLM in audio-only, video-only and audio+video modes) could reveal whether the problem lies in the modalities' integration being inefficient or in noise/conflict between the modalities. I would like to see more detail on this, either in the fo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.