Response Wide Shut: Surprising Observations in Basic Vision Language   Model Capabilities

Shivam Chandhok; Wan-Cyuan Fan; Leonid Sigal

arXiv:2408.06721·cs.CV·August 14, 2024

Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities

Shivam Chandhok, Wan-Cyuan Fan, Leonid Sigal

PDF

Open Access

TL;DR

This paper investigates the fundamental limitations of state-of-the-art vision-language models in basic visual tasks, revealing surprising shortcomings and providing insights for future improvements beyond current benchmarks.

Contribution

The study introduces comprehensive tests probing core visual understanding skills and compares model components, uncovering unexpected weaknesses in VLMs not captured by standard benchmarks.

Findings

01

VLMs show limitations in object classification and spatial reasoning.

02

Intermediate features reveal more about model shortcomings than final outputs.

03

Probing different model components uncovers nascent response deficiencies.

Abstract

Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. Such models have been shown to be highly capable, but, at the same time, also lacking some basic visual understanding skills. In this paper, we set out to understand the limitations of SoTA VLMs on fundamental visual tasks: object classification, understanding spatial arrangement, and ability to delineate individual object instances (through counting), by constructing a series of tests that probe which components of design, specifically, maybe lacking. Importantly, we go significantly beyond the current benchmarks, that simply measure final performance of VLM, by also comparing and contrasting it to performance of probes trained directly on features obtained from visual encoder (image embeddings), as well as intermediate vision-language projection used to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsSparse Evolutionary Training · BLIP: Bootstrapping Language-Image Pre-training