Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang; Wengang Zhou; Jie Zhao; Houqiang Li

arXiv:2507.07151·cs.CV·July 11, 2025

Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

PDF

Open Access

TL;DR

This paper identifies modality conflict as a key cause of hallucinations in multimodal large language models and proposes methods to mitigate this issue, improving model robustness in vision-language tasks.

Contribution

It formally defines modality conflict, creates the MMMC dataset, and evaluates three mitigation methods, highlighting reinforcement learning as most effective.

Findings

01

Reinforcement learning best reduces hallucinations caused by modality conflict.

02

Supervised fine-tuning offers stable and promising performance.

03

The study introduces the first dataset specifically for modality conflict in MLLMs.

Abstract

Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Ferroelectric and Negative Capacitance Devices · Advanced Graph Neural Networks