Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models
Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR
This paper investigates how representation disentanglement affects compositional generalization in CLIP models, using a carefully synthesized dataset to evaluate true out-of-distribution performance and identify key factors for improvement.
Contribution
It introduces a novel dataset for authentic C-OoD evaluation and demonstrates that disentangled representations are crucial for CLIP's compositional generalization.
Findings
Disentanglement correlates with better C-OoD performance
Varying C-OoD generalization observed across CLIP models
Disentanglement metrics can predict generalization capabilities
Abstract
CLIP models have recently shown to exhibit Out of Distribution (OoD) generalization capabilities. However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model's ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. We noted that previous studies regarding compositional understanding of CLIPs frequently fail to ensure that test samples are genuinely novel relative to the CLIP training data. To this end, we carefully synthesized a large and diverse dataset in the single object setting, comprising attributes for objects that are highly unlikely to be encountered in the combined training datasets of various CLIP models. This dataset enables an authentic evaluation of C-OoD generalization. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsItaly: Economic History and Contemporary Issues
MethodsContrastive Language-Image Pre-training
