Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao; Zhiyu Wu; Hao Lin; Yi Chen; Yahui Liu; Xiaoran Zhao; Zixu Wang; Zejiang He

arXiv:2602.21035·cs.CV·February 25, 2026

Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning

Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He

PDF

Open Access 1 Video

TL;DR

This paper introduces CLIPGlasses, a plug-and-play framework that improves CLIP's understanding of negated visual descriptions without fine-tuning, enhancing robustness and cross-domain performance.

Contribution

The paper proposes CLIPGlasses, a novel dual-stage framework that disentangles negation semantics and predicts context-aware repulsion to improve CLIP's negation comprehension without fine-tuning.

Findings

01

Outperforms state-of-the-art in cross-domain tasks

02

Achieves strong robustness in low-resource settings

03

Maintains competitive in-domain performance

Abstract

Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Not Just What’s There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-Tuning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling