Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens
Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen

TL;DR
This paper introduces a token-level causal framework to explain CLIP's struggles with compositional reasoning, revealing fundamental limitations in current contrastive learning approaches and guiding future improvements.
Contribution
It develops a token-aware causal theory for CLIP, providing the first principled explanation of its compositional brittleness and nonidentifiability issues at token granularity.
Findings
Token-level analysis explains CLIP's failure on compositional tasks.
Existence of pseudo-optimal encoders that ignore compositional differences.
Iterated composition operators increase hardness, affecting negative mining.
Abstract
Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
