Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, Saining, Xie

TL;DR
This paper investigates the visual shortcomings of multimodal large language models, introduces a benchmark to evaluate their visual reasoning, and proposes a method to improve their visual grounding by integrating self-supervised visual features.
Contribution
It identifies systematic visual shortcomings in current multimodal LLMs, creates the MMVP benchmark, and proposes a Mixture of Features approach to enhance visual grounding.
Findings
State-of-the-art models struggle with basic visual patterns.
MMVP benchmark reveals specific visual reasoning errors.
Integrating self-supervised visual features improves visual grounding.
Abstract
Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗google/paligemma-3b-pt-224model· 86k dl· ♡ 42686k dl♡ 426
- 🤗google/paligemma-3b-mix-448model· 2.9k dl· ♡ 1162.9k dl♡ 116
- 🤗StarCycle/llava-dinov2-internlm2-7b-v1model· ♡ 8♡ 8
- 🤗google/paligemma-3b-pt-224-jaxmodel· 205 dl· ♡ 3205 dl♡ 3
- 🤗google/paligemma-3b-pt-448-jaxmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗google/paligemma-3b-pt-896-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-aokvqa-mc-448-jaxmodel
- 🤗google/paligemma-3b-ft-textcaps-224-jaxmodel
- 🤗google/paligemma-3b-ft-widgetcap-448-jaxmodel· ♡ 2♡ 2
- 🤗google/paligemma-3b-ft-vqav2-448-jaxmodel· 1 dl· ♡ 21 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
