Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Leyan Xue; Zongbo Han; Guangyu Wang; Qinghua Hu; Mingyue Cheng; Changqing Zhang

arXiv:2507.03458·cs.CV·July 8, 2025

Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Leyan Xue, Zongbo Han, Guangyu Wang, Qinghua Hu, Mingyue Cheng, Changqing Zhang

PDF

TL;DR

This paper introduces a simple, effective method to enhance CLIP's ability to recognize both global and local visual details by using stochastic multi-crop augmentation, addressing its bias towards global image patterns.

Contribution

The paper proposes a plug-and-play multi-crop augmentation technique that enables CLIP to better process localized visual features, improving its performance in various settings.

Findings

01

Improves CLIP's recognition of local details without retraining.

02

Enhances zero-shot, few-shot, and test-time adaptation performance.

03

Addresses CLIP's bias towards global image patterns effectively.

Abstract

Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training