Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

Takaki Yamamoto; Chihiro Noguchi; Toshihiro Tanizawa

arXiv:2601.12809·cs.CV·January 21, 2026

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

PDF

Open Access

TL;DR

This paper investigates how CLIP-style vision-language models learn spatial relations, especially left-right understanding, revealing that label diversity and attention mechanisms play key roles in this process.

Contribution

It introduces a controllable testbed for probing spatial understanding and uncovers the mechanisms behind how contrastive training enables relational learning in these models.

Findings

01

Contrastive training learns left-right relations.

02

Label diversity drives generalization more than layout diversity.

03

Attention interactions induce symmetry-breaking in encoders.

Abstract

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Categorization, perception, and language