CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally
Darina Koishigarina, Arnas Uselis, Seong Joon Oh

TL;DR
This paper reveals that CLIP encodes attribute-object bindings within its embeddings but fails to align them cross-modally, and demonstrates that a simple linear transformation can improve its binding performance without retraining the encoders.
Contribution
The study shows that CLIP's cross-modal alignment is the weak point, and proposes a linear transformation to access encoded binding information, enhancing its performance.
Findings
CLIP encodes attribute-object bindings unimodally.
Cross-modal alignment fails to preserve binding information.
A linear transformation improves cross-modal binding performance.
Abstract
CLIP (Contrastive Language-Image Pretraining) has become a popular choice for various downstream tasks. However, recent studies have questioned its ability to represent compositional concepts effectively. These works suggest that CLIP often acts like a bag-of-words (BoW) model, interpreting images and text as sets of individual concepts without grasping the structural relationships. In particular, CLIP struggles to correctly bind attributes to their corresponding objects when multiple objects are present in an image or text. In this work, we investigate why CLIP exhibits this BoW-like behavior. Our key finding is that CLIP does not lack binding information. Through linear probing, robustness tests with increasing object counts, and conjunctive search experiments, we show that attribute-object bindings are already encoded within CLIP's text and image embeddings. The weakness lies in the…
Peer Reviews
Decision·ICLR 2026 Poster
- The use of multiple, complementary experimental paradigms (linear probing, robustness to object count, conjunctive search) provides multi-faceted evidence for the uni-modal binding claim. - The paper is well-written and clear. The core insight is presented early and reinforced throughout. Figures 1, 3, and 5 are particularly effective in illustrating the key concepts.
1. The conclusion that “the problem lies in cross-modal alignment” remains descriptive. The paper lacks a deeper theoretical or mathematical explanation of why the alignment loss fails to preserve binding (e.g., analyzing the contrastive objective’s geometry or the modality gap quantitatively). 2. The paper strongly asserts that the problem is only alignment. However, linear probing showing high accuracy only proves the information is linearly recoverable, not that it is represented in a way th
The paper is well written. The introduction clearly motivates an important problem and explains how prior work does not fully investigate the source of bag-of-wordness. The related-work section is thorough and effectively positions the paper within the literature. Large-scale experiments: The paper shows that concept binding can be improved with a simple linear probe on large-scale datasets such as ARO and SugarCrepe.
**Differences in results in prior work.** While the paper’s results are encouraging, I would like to understand the differences from existing works that show that CLIP, even after fine-tuning, does not bind concepts [a, b]. In [a], the authors claim that the representations are not expressive enough to bind concepts. They introduce the Concept Binding Benchmark and show that fine-tuning CLIP and linear probes (the type-logical model in their experiments) do not generalize to held-out compositio
This paper investigates the reason behind the CLIP models' BoW property which is often unexplored from the previous compositional reasoning literature. The controlled experiments and analyses made could be a valuable reference for guiding a better solution for the following works. From the methodological perspective, LABCLIP for appending a linear projection matrix on either image or text embedding is effective to remedy the CLIP models' attribute-object binding capabilities, while it does not n
One major concern lies in the scope of the work with respective to compositional reasoning tasks in several aspects. While the paper mainly focused on a relatively simple settings, attribute-object binding, but it is not yet discussed whether the observations and solutions made in the paper can generalize to a more complex and realistic compositional reasoning benchmarks such as Winoground. In addition, since the oracle method to the proposed solution is NegCLIP, which is trained on both origina
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicslinguistics and terminology studies
MethodsContrastive Language-Image Pre-training
