Modality Bias in LVLMs: Analyzing and Mitigating Object Hallucination via Attention Lens
Haohan Zheng, Zhenguo Zhang

TL;DR
This paper investigates the phenomenon of modality bias in large vision-language models (LVLMs), revealing that they often ignore visual or textual cues during hallucination, and proposes a training-free attention adjustment method to mitigate object hallucination.
Contribution
The study uncovers modality bias as a key factor in LVLM hallucination and introduces a simple, training-free attention adjustment technique to improve multimodal understanding.
Findings
Modality bias is prevalent in LVLMs, affecting their performance.
Adjusting attention weights reduces object hallucination effectively.
The proposed method generalizes across multiple LVLMs and benchmarks.
Abstract
Large vision-language models (LVLMs) have demonstrated remarkable multimodal comprehension and reasoning capabilities, but they still suffer from severe object hallucination. Previous studies primarily attribute the flaw to linguistic prior caused by the scale mismatch between visual encoders and large language models (LLMs) in LVLMs. Specifically, as current LVLMs are built upon LLMs, they tend to over-rely on textual prompts and internal knowledge of LLMs, generating descriptions inconsistent with visual cues. However, through an in-depth investigation of the hallucinated mechanisms, we empirically reveal a previously overlooked phenomenon: LVLMs may ignore not only visual information but also textual modality during hallucination, a behavior termed as modality bias, which indicates that LVLMs struggle to simultaneously attend to both visual and textual modalities, leading to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Ferroelectric and Negative Capacitance Devices · Subtitles and Audiovisual Media
