Looking Beyond Text: Reducing Language bias in Large Vision-Language   Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao; Shuzheng Si; Liang Chen; Yichi Zhang; Maosong Sun,; Mingjia Zhang; Baobao Chang

arXiv:2411.14279·cs.CV·November 22, 2024

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao, Shuzheng Si, Liang Chen, Yichi Zhang, Maosong Sun,, Mingjia Zhang, Baobao Chang

PDF

Open Access 1 Video

TL;DR

This paper introduces LACING, a framework that reduces language bias in large vision-language models by using multimodal dual-attention and soft-image guidance, improving visual understanding and decreasing hallucinations.

Contribution

The paper presents a novel systemic framework with dual-attention and soft-image guidance to effectively mitigate language bias in LVLMs without extra data or training.

Findings

01

Significant reduction in hallucinations and improved visual comprehension.

02

Enhanced integration of visual inputs via dual-attention mechanism.

03

Effective debiasing demonstrated through comprehensive experiments.

Abstract

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance· underline

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsFocus