EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
Bohao Xing, Xin Liu, Guoying Zhao, Chengyu Liu, Xiaolan Fu, Heikki K\"alvi\"ainen

TL;DR
This paper introduces EmotionHallucer, a benchmark for evaluating emotion hallucinations in multimodal large language models, revealing significant issues and proposing a new detection framework to improve model assessment.
Contribution
It is the first benchmark dedicated to detecting and analyzing emotion hallucinations in MLLMs, leveraging emotion psychology and adversarial QA for robust evaluation.
Findings
Most models show substantial emotion hallucination issues.
Closed-source models outperform open-source ones in detection.
Models perform better in emotion psychology knowledge than multimodal perception.
Abstract
Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two…
Peer Reviews
Decision·ICLR 2026 Poster
This study fills a critical gap in the evaluation of emotional hallucinations, offering the first benchmark tailored for MLLM emotional understanding. The introduction of the PEP-MEK framework shows significant effectiveness, enhancing model performance in hallucination detection. The authors provide robust experimental data and statistical evidence to support their conclusions, increasing the paper's credibility. The research methodology integrates insights from emotion psychology, ensuring the
Language Limitation: The study is restricted to English, failing to account for cross-linguistic and cross-cultural variations in emotional expression. Complex Definitions: The definitions and classifications of emotional hallucinations may be overly intricate, potentially leading to ambiguity in the evaluation process. Suboptimal Performance: Model performance in processing multimodal data, particularly in audio and video emotional understanding, remains inadequate. Result Stability: The stabil
This is the first benchmark that is dedicated to emotion hallucinations, spanning both psychology knowledge and multimodal perception; prior hallucination suites are general-purpose. The seven subcategories (theory/definition/finding; category/intensity/reasoning cue/reasoning result) make the construct very concrete. And, the adversarial paired QA design (basic vs hallucinated) is what I call a neat, low-variance way to test detection of hallucination, beyond typical caption/LLM-judge setups.
1. Adversarial pair construction & QA artifacts. The process risks introducing superficial cues between the basic and “hallucinated” versions. Report inter-annotator agreement, pair-level quality controls, and checks against annotation artifacts (e.g., spurious lexical markers). 2. Latency/compute overhead and failure cases are not quantified. A wall-clock and token-cost-wise comparison is needed here, along with ablations for each PEP-MEK component and per-subcategory gains.
### A novel benchmark (EmotionHallucer) specifically targeting emotion hallucinations in MLLMs: - Covers multiple modalities and multiple diagnostic levels (perception, emotion knowledge, reasoning results). - Uses adversarially constructed basic vs. hallucinated QA pairs to probe hallucination propensity. ### Large-scale empirical evaluation and analysis: - Systematic evaluation of 41 MLLMs (both open- and closed-source) with detailed metrics (Pct. Diff, FP Ratio, separate Basic vs. Hallucinate
### Annotation noise and scope limited to English and certain datasets: - The benchmark relies on human annotation (e.g., creating hallucinated variants), admitting annotation noise. - The dataset is English-only and does not address cross-lingual or cultural variability in emotional expression. ### Partial exploration of root causes: - While the paper documents hallucination phenomena and correlates them with modality and model class, it does not deeply investigate underlying causes (e.g., pret
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Topic Modeling · Emotion and Mood Recognition
