Better Language Models Exhibit Higher Visual Alignment

Jona Ruthardt; Gertjan J. Burghouts; Serge Belongie; Yuki M. Asano

arXiv:2410.07173·cs.CL·January 19, 2026

Better Language Models Exhibit Higher Visual Alignment

Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano

PDF

Open Access 1 Models

TL;DR

This paper investigates how large language models align with visual data, revealing that decoder-based models show stronger visual alignment and proposing a lightweight fusion method that enhances cross-modal performance with minimal data and compute.

Contribution

The study systematically evaluates visual alignment in language models, finds decoder-based models excel, and introduces ShareLock, a simple fusion approach that improves vision-language tasks efficiently.

Findings

01

Decoder-based models have stronger visual alignment than encoders.

02

Language modeling performance correlates with visual generalization.

03

ShareLock achieves high accuracy with minimal data and compute.

Abstract

How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
FunAILab/ShareLock
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques