ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

Qi Zhang; Yuxu Chen; Lei Deng; and Lili Shen

arXiv:2512.17178·cs.CV·December 22, 2025

ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching

Qi Zhang, Yuxu Chen, Lei Deng, and Lili Shen

PDF

Open Access

TL;DR

This paper introduces ABE-CLIP, a training-free method that enhances attribute-object binding in CLIP models by refining token embeddings and aligning local tokens with image patches, leading to improved compositional image-text matching.

Contribution

It proposes a novel training-free approach with semantic refinement and local alignment strategies to improve attribute binding in CLIP without additional training.

Findings

01

Significantly improves attribute-object binding performance.

02

Outperforms training-based methods on multiple datasets.

03

Enhances semantic precision in image-text matching.

Abstract

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling