Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong; Zhuang Liu; Yuexiang Zhai; Yi Ma; Yann LeCun; Saining; Xie

arXiv:2401.06209·cs.CV·April 26, 2024·6 cites

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining, Xie

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

This paper investigates the visual shortcomings of multimodal large language models, introduces a benchmark to evaluate their visual reasoning, and proposes a method to improve their visual grounding by integrating self-supervised visual features.

Contribution

It identifies systematic visual shortcomings in current multimodal LLMs, creates the MMVP benchmark, and proposes a Mixture of Features approach to enhance visual grounding.

Findings

01

State-of-the-art models struggle with basic visual patterns.

02

MMVP benchmark reveals specific visual reasoning errors.

03

Integrating self-supervised visual features improves visual grounding.

Abstract

Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tsb0601/MMVP
pytorchOfficial

Models

Datasets

lmms-lab-eval/MMVP
dataset· 675 dl
675 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques

MethodsContrastive Language-Image Pre-training