Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs
Chenchen Lin, Sanbao Su, Rachel Luo, Yuxiao Chen, Yan Wang, Marco Pavone, Fei Miao

TL;DR
This paper introduces TGIF, a lightweight, query-dependent fusion module that enhances visual grounding in multimodal LLMs by leveraging hierarchical visual features, thereby reducing hallucinations and improving performance across multiple benchmarks.
Contribution
The paper proposes TGIF, a novel, prompt-dependent inter-layer fusion method that exploits the visual hierarchy without updating the vision encoder, improving hallucination mitigation in MLLMs.
Findings
TGIF reduces hallucinations in MLLMs.
Improves performance on OCR, VQA benchmarks.
Maintains or enhances scores on ScienceQA, GQA, MMBench.
Abstract
Multimodal large language models (MLLMs) typically rely on a single late-layer feature from a frozen vision encoder, leaving the encoder's rich hierarchy of visual cues under-utilized. MLLMs still suffer from visually ungrounded hallucinations, often relying on language priors rather than image evidence. While many prior mitigation strategies operate on the text side, they leave the visual representation unchanged and do not exploit the rich hierarchy of features encoded across vision layers. Existing multi-layer fusion methods partially address this limitation but remain static, applying the same layer mixture regardless of the query. In this work, we introduce TGIF (Text-Guided Inter-layer Fusion), a lightweight module that treats encoder layers as depth-wise "experts" and predicts a prompt-dependent fusion of visual features. TGIF follows the principle of direct external fusion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Face Recognition and Perception
