Better Language Models Exhibit Higher Visual Alignment
Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

TL;DR
This paper investigates how large language models align with visual data, revealing that decoder-based models show stronger visual alignment and proposing a lightweight fusion method that enhances cross-modal performance with minimal data and compute.
Contribution
The study systematically evaluates visual alignment in language models, finds decoder-based models excel, and introduces ShareLock, a simple fusion approach that improves vision-language tasks efficiently.
Findings
Decoder-based models have stronger visual alignment than encoders.
Language modeling performance correlates with visual generalization.
ShareLock achieves high accuracy with minimal data and compute.
Abstract
How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
