EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

Bohao Xing; Xin Liu; Guoying Zhao; Chengyu Liu; Xiaolan Fu; Heikki K\"alvi\"ainen

arXiv:2505.11405·cs.CV·May 19, 2025

EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models

Bohao Xing, Xin Liu, Guoying Zhao, Chengyu Liu, Xiaolan Fu, Heikki K\"alvi\"ainen

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces EmotionHallucer, a benchmark for evaluating emotion hallucinations in multimodal large language models, revealing significant issues and proposing a new detection framework to improve model assessment.

Contribution

It is the first benchmark dedicated to detecting and analyzing emotion hallucinations in MLLMs, leveraging emotion psychology and adversarial QA for robust evaluation.

Findings

01

Most models show substantial emotion hallucination issues.

02

Closed-source models outperform open-source ones in detection.

03

Models perform better in emotion psychology knowledge than multimodal perception.

Abstract

Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

This study fills a critical gap in the evaluation of emotional hallucinations, offering the first benchmark tailored for MLLM emotional understanding. The introduction of the PEP-MEK framework shows significant effectiveness, enhancing model performance in hallucination detection. The authors provide robust experimental data and statistical evidence to support their conclusions, increasing the paper's credibility. The research methodology integrates insights from emotion psychology, ensuring the

Weaknesses

Language Limitation: The study is restricted to English, failing to account for cross-linguistic and cross-cultural variations in emotional expression. Complex Definitions: The definitions and classifications of emotional hallucinations may be overly intricate, potentially leading to ambiguity in the evaluation process. Suboptimal Performance: Model performance in processing multimodal data, particularly in audio and video emotional understanding, remains inadequate. Result Stability: The stabil

Reviewer 02Rating 6Confidence 4

Strengths

This is the first benchmark that is dedicated to emotion hallucinations, spanning both psychology knowledge and multimodal perception; prior hallucination suites are general-purpose. The seven subcategories (theory/definition/finding; category/intensity/reasoning cue/reasoning result) make the construct very concrete. And, the adversarial paired QA design (basic vs hallucinated) is what I call a neat, low-variance way to test detection of hallucination, beyond typical caption/LLM-judge setups.

Weaknesses

1. Adversarial pair construction & QA artifacts. The process risks introducing superficial cues between the basic and “hallucinated” versions. Report inter-annotator agreement, pair-level quality controls, and checks against annotation artifacts (e.g., spurious lexical markers). 2. Latency/compute overhead and failure cases are not quantified. A wall-clock and token-cost-wise comparison is needed here, along with ablations for each PEP-MEK component and per-subcategory gains.

Reviewer 03Rating 4Confidence 4

Strengths

### A novel benchmark (EmotionHallucer) specifically targeting emotion hallucinations in MLLMs: - Covers multiple modalities and multiple diagnostic levels (perception, emotion knowledge, reasoning results). - Uses adversarially constructed basic vs. hallucinated QA pairs to probe hallucination propensity. ### Large-scale empirical evaluation and analysis: - Systematic evaluation of 41 MLLMs (both open- and closed-source) with detailed metrics (Pct. Diff, FP Ratio, separate Basic vs. Hallucinate

Weaknesses

### Annotation noise and scope limited to English and certain datasets: - The benchmark relies on human annotation (e.g., creating hallucinated variants), admitting annotation noise. - The dataset is English-only and does not address cross-lingual or cultural variability in emotional expression. ### Partial exploration of root causes: - While the paper documents hallucination phenomena and correlates them with modality and model class, it does not deeply investigate underlying causes (e.g., pret

Code & Models

Repositories

xxtars/emotionhallucer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health via Writing · Topic Modeling · Emotion and Mood Recognition