Zero-Shot Visual Classification with Guided Cropping
Piyapat Saranrittichai, Mauricio Munoz, Volker Fischer, Chaithanya, Kumar Mummadi

TL;DR
This paper introduces GC-CLIP, a method that enhances zero-shot image classification by using object detection to focus on relevant regions, improving accuracy especially for small objects.
Contribution
The paper proposes GC-CLIP, a novel approach combining zero-shot detection with CLIP to improve classification by focusing on objects of interest.
Findings
Improved zero-shot classification accuracy across datasets.
Enhanced performance on small objects.
Consistent gains across different architectures.
Abstract
Pretrained vision-language models, such as CLIP, show promising zero-shot performance across a wide variety of datasets. For closed-set classification tasks, however, there is an inherent limitation: CLIP image encoders are typically designed to extract generic image-level features that summarize superfluous or confounding information for the target tasks. This results in degradation of classification performance, especially when objects of interest cover small areas of input images. In this work, we propose CLIP with Guided Cropping (GC-CLIP), where we use an off-the-shelf zero-shot object detection model in a preprocessing step to increase focus of zero-shot classifier to the object of interest and minimize influence of extraneous image regions. We empirically show that our approach improves zero-shot classification results across architectures and datasets, favorably for small…
Peer Reviews
Decision·Submitted to ICLR 2024
1- The language is clearly presented. The authors use precise and concise language so that the reader can easily understand the background, methodology, and results of the study. 2- Ablation studies are comprehensive. The authors demonstrated the superiority of GC-CLIP over CLIP through many ablation studies and analysed various factors.
1- I suggest the authors report the computational cost of GC-CLIP in the paper, including the parameters, FLOPs or the inference time, for a more comprehensive comparison with CLIP. 2- I am confused about the necessity of combining OWL-ViT and CLIP, because the authors’ results in the experimental section show that the difference between introducing OWL-ViT for guided cropping and using random cropping is slight, more results and analysis on different datasets should be provided to illustrate th
The approach is simple and easy to re-implement.
The novelty is limited. Many papers [1,2,...] have discussed the impact of cropping in image classification. The paper aims to find an optimal crop but there is no technical contribution since the heavy-lifting is done purely based on the pre-trained object detector. Perhaps the core contribution is to show that an object detector can be used for this purpose? I think it is incremental. The potential applicability is limited. The method is very specific to CLIP and the core method doesn't work
* Solid paper, clearly written and well-motived. * The method is pragmatic, and seemingly driven by practical considerations of actually using CLIP and OWL-ViT models in real life applications. * The proposed inference pipeline does not require any training and thus can be readily used for many applications.
* The observations of current limitations of CLIP and OWL-ViT models are somewhat surface-level and I believe well-known (although possibly not written down in a publication) * The proposed solution to the observed limitations (i.e. the proposed inference pipeline) is as far as I know novel, but maybe better presented at a more computer vision focused conference. * Although the focus of the proposed approach is to correct failure cases of CLIP and OWL-ViT (i.e. small object classification), and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
MethodsFocus · Contrastive Language-Image Pre-training
