Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data
Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

TL;DR
This paper investigates how CLIP-style vision-language models learn spatial relations, especially left-right understanding, revealing that label diversity and attention mechanisms play key roles in this process.
Contribution
It introduces a controllable testbed for probing spatial understanding and uncovers the mechanisms behind how contrastive training enables relational learning in these models.
Findings
Contrastive training learns left-right relations.
Label diversity drives generalization more than layout diversity.
Attention interactions induce symmetry-breaking in encoders.
Abstract
Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Categorization, perception, and language
