Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Ziliang Chen; Tianang Xiao; Jusheng Zhang; Yongsen Zheng; Xipeng Chen

arXiv:2510.26302·cs.LG·October 31, 2025

Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen

PDF

TL;DR

This paper introduces a token-level causal framework to explain CLIP's struggles with compositional reasoning, revealing fundamental limitations in current contrastive learning approaches and guiding future improvements.

Contribution

It develops a token-aware causal theory for CLIP, providing the first principled explanation of its compositional brittleness and nonidentifiability issues at token granularity.

Findings

01

Token-level analysis explains CLIP's failure on compositional tasks.

02

Existence of pseudo-optimal encoders that ignore compositional differences.

03

Iterated composition operators increase hardness, affecting negative mining.

Abstract

Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.