LanP: Rethinking the Impact of Language Priors in Large Vision-Language   Models

Zongyu Wu; Yuwei Niu; Hongcheng Gao; Minhua Lin; Zhiwei Zhang; Zhifang; Zhang; Qi Shi; Yilong Wang; Sike Fu; Junjie Xu; Junjie Ao; Enyan Dai; Lei; Feng; Xiang Zhang; Suhang Wang

arXiv:2502.12359·cs.CV·February 19, 2025

LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models

Zongyu Wu, Yuwei Niu, Hongcheng Gao, Minhua Lin, Zhiwei Zhang, Zhifang, Zhang, Qi Shi, Yilong Wang, Sike Fu, Junjie Xu, Junjie Ao, Enyan Dai, Lei, Feng, Xiang Zhang, Suhang Wang

PDF

Open Access

TL;DR

This paper introduces LanP, a benchmark to evaluate the strength of language priors in large vision-language models, revealing many models struggle with visual scenarios involving partial object occlusion.

Contribution

The paper proposes LanP, a new benchmark for assessing language priors in LVLMs, and provides extensive experimental analysis across 25 models to understand their limitations.

Findings

01

Many LVLMs have weak language priors in challenging visual scenarios.

02

Models like GPT-4 Turbo perform poorly with accuracy below 0.5 on partial occlusion questions.

03

LanP reveals the need to balance language priors to improve LVLM robustness.

Abstract

Large Vision-Language Models (LVLMs) have shown impressive performance in various tasks. However, LVLMs suffer from hallucination, which hinders their adoption in the real world. Existing studies emphasized that the strong language priors of LVLMs can overpower visual information, causing hallucinations. However, the positive role of language priors is the key to a powerful LVLM. If the language priors are too weak, LVLMs will struggle to leverage rich parameter knowledge and instruction understanding abilities to complete tasks in challenging visual scenarios where visual information alone is insufficient. Therefore, we propose a benchmark called LanP to rethink the impact of Language Priors in LVLMs. It is designed to investigate how strong language priors are in current LVLMs. LanP consists of 170 images and 340 corresponding well-designed questions. Extensive experiments on 25…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Speech and dialogue systems