ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
Qi Zhang, Yuxu Chen, Lei Deng, and Lili Shen

TL;DR
This paper introduces ABE-CLIP, a training-free method that enhances attribute-object binding in CLIP models by refining token embeddings and aligning local tokens with image patches, leading to improved compositional image-text matching.
Contribution
It proposes a novel training-free approach with semantic refinement and local alignment strategies to improve attribute binding in CLIP without additional training.
Findings
Significantly improves attribute-object binding performance.
Outperforms training-based methods on multiple datasets.
Enhances semantic precision in image-text matching.
Abstract
Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
