Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun,, Mingjia Zhang, Baobao Chang

TL;DR
This paper introduces LACING, a framework that reduces language bias in large vision-language models by using multimodal dual-attention and soft-image guidance, improving visual understanding and decreasing hallucinations.
Contribution
The paper presents a novel systemic framework with dual-attention and soft-image guidance to effectively mitigate language bias in LVLMs without extra data or training.
Findings
Significant reduction in hallucinations and improved visual comprehension.
Enhanced integration of visual inputs via dual-attention mechanism.
Effective debiasing demonstrated through comprehensive experiments.
Abstract
Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsFocus
