Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel, Rodrigo Toro Icarte

TL;DR
This paper demonstrates that visual data training can improve language models' ability to generalize in text-only tasks by altering internal binding strategies, leading to more robust reasoning.
Contribution
It reveals how visual training changes model binding strategies, enhancing out-of-distribution generalization in language models through cross-modal learning.
Findings
Visual training nearly doubles OOD performance in synthetic retrieval tasks.
Visual training disrupts positional shortcuts, promoting robust symbolic binding.
Cross-modal training improves reasoning even in single-modality tasks.
Abstract
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Mechanistic interpretability reveals that visual training changes the model's internal binding strategy: text-only training encourages positional shortcuts, whereas image-based training disrupts them through spatial translation invariance, forcing the model to adopt a more robust symbolic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Language and cultural evolution · Language, Metaphor, and Cognition
