Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He

TL;DR
This paper introduces CLIPGlasses, a plug-and-play framework that improves CLIP's understanding of negated visual descriptions without fine-tuning, enhancing robustness and cross-domain performance.
Contribution
The paper proposes CLIPGlasses, a novel dual-stage framework that disentangles negation semantics and predicts context-aware repulsion to improve CLIP's negation comprehension without fine-tuning.
Findings
Outperforms state-of-the-art in cross-domain tasks
Achieves strong robustness in low-resource settings
Maintains competitive in-domain performance
Abstract
Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
