LLMs Can Compensate for Deficiencies in Visual Representations

Sho Takishita; Jay Gala; Abdelrahman Mohamed; Kentaro Inui; Yova Kementchedjhieva

arXiv:2506.05439·cs.CV·September 22, 2025

LLMs Can Compensate for Deficiencies in Visual Representations

Sho Takishita, Jay Gala, Abdelrahman Mohamed, Kentaro Inui, Yova Kementchedjhieva

PDF

Open Access 1 Video

TL;DR

This paper explores how language components in vision-language models can compensate for weak visual features, revealing a dynamic division of labor that enhances multimodal task performance.

Contribution

It demonstrates that language backbones in VLMs can offset limitations in visual representations, suggesting new architectural directions.

Findings

01

CLIP visual features contain accessible semantic information

02

Language decoders can compensate for visual deficiencies

03

Reduced visual contextualization can be mitigated by language models

Abstract

Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

LLMs Can Compensate for Deficiencies in Visual Representations· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning