Probing Visual Language Priors in VLMs

Tiange Luo; Ang Cao; Gunhee Lee; Justin Johnson; Honglak Lee

arXiv:2501.00569·cs.CV·April 15, 2025

Probing Visual Language Priors in VLMs

Tiange Luo, Ang Cao, Gunhee Lee, Justin Johnson, Honglak Lee

PDF

Open Access 2 Models 2 Datasets 1 Video

TL;DR

This paper introduces ViLP, a benchmark to evaluate visual reasoning in VLMs using out-of-distribution images and questions, revealing their over-reliance on language priors and proposing a self-training method to improve visual reasoning capabilities.

Contribution

The paper presents ViLP, a novel benchmark for testing visual reasoning in VLMs, and a self-improving training framework that enhances models' focus on visual inputs.

Findings

01

GPT-4 scores 66.17% on ViLP, indicating room for improvement.

02

Self-training with generated data boosts VLM performance.

03

Models trained with our method outperform baseline models on ViLP.

Abstract

Despite recent advances in Vision-Language Models (VLMs), they may over-rely on visual language priors existing in their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although, humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4 achieves only 66.17% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training. Our training objectives compel VLMs to focus more on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

Probing Visual Language Priors in VLMs· slideslive

Taxonomy

TopicsSpeech and dialogue systems · Text Readability and Simplification · Natural Language Processing Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Dropout · Linear Layer · Softmax · Adam · Residual Connection · Multi-Head Attention