VLind-Bench: Measuring Language Priors in Large Vision-Language Models
Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee,, Hyukhun Koh, Kyomin Jung

TL;DR
This paper introduces VLind-Bench, a new benchmark designed to accurately measure language priors in large vision-language models, revealing widespread reliance on textual patterns over image content.
Contribution
The paper presents VLind-Bench, the first benchmark specifically targeting language priors in LVLMs, with comprehensive tests to disentangle priors from other factors.
Findings
Most LVLMs heavily rely on language priors.
Existing benchmarks inadequately measure language priors.
VLind-Bench effectively isolates language priors from other influences.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
