The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio
Sicong Leng,Yun Xing,Zesen Cheng,Yang Zhou,Hang Zhang,Xin Li,Deli, Zhao,Shijian Lu,Chunyan Miao,Lidong Bing

TL;DR
This paper systematically investigates hallucinations in large multimodal models across language, visual, and audio modalities, revealing key vulnerabilities and proposing a benchmark to evaluate and address these issues.
Contribution
It introduces the first comprehensive benchmark for evaluating hallucinations in multimodal models and analyzes the main causes of these hallucinations.
Findings
Hallucinations stem from overreliance on unimodal priors.
Spurious inter-modality correlations contribute to hallucinations.
Imbalances in modality integration lead to increased hallucination risk.
Abstract
Recent advancements in large multimodal models (LMMs) have significantly enhanced performance across diverse tasks, with ongoing efforts to further integrate additional modalities such as video and audio. However, most existing LMMs remain vulnerable to hallucinations, the discrepancy between the factual multimodal input and the generated textual output, which has limited their applicability in various real-world scenarios. This paper presents the first systematic investigation of hallucinations in LMMs involving the three most common modalities: language, visual, and audio. Our study reveals two key contributors to hallucinations: overreliance on unimodal priors and spurious inter-modality correlations. To address these challenges, we introduce the benchmark The Curse of Multi-Modalities (CMM), which comprehensively evaluates hallucinations in LMMs, providing a detailed analysis of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
