Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta; Felipe del Rio; Cristian Hinostroza; Denis Parra; Hans Lobel; Rodrigo Toro Icarte

arXiv:2602.15183·cs.LG·February 18, 2026

Seeing to Generalize: How Visual Data Corrects Binding Shortcuts

Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte

PDF

Open Access

TL;DR

This paper demonstrates that visual data training can improve language models' ability to generalize in text-only tasks by altering internal binding strategies, leading to more robust reasoning.

Contribution

It reveals how visual training changes model binding strategies, enhancing out-of-distribution generalization in language models through cross-modal learning.

Findings

01

Visual training nearly doubles OOD performance in synthetic retrieval tasks.

02

Visual training disrupts positional shortcuts, promoting robust symbolic binding.

03

Cross-modal training improves reasoning even in single-modality tasks.

Abstract

Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Language, Metaphor, and Cognition