Robust Multimodal Large Language Models Against Modality Conflict
Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

TL;DR
This paper identifies modality conflict as a key cause of hallucinations in multimodal large language models and proposes methods to mitigate this issue, improving model robustness in vision-language tasks.
Contribution
It formally defines modality conflict, creates the MMMC dataset, and evaluates three mitigation methods, highlighting reinforcement learning as most effective.
Findings
Reinforcement learning best reduces hallucinations caused by modality conflict.
Supervised fine-tuning offers stable and promising performance.
The study introduces the first dataset specifically for modality conflict in MLLMs.
Abstract
Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Ferroelectric and Negative Capacitance Devices · Advanced Graph Neural Networks
