Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval
Seongha Eom, Namgyu Ho, Jaehoon Oh, Se-Young Yun

TL;DR
This paper introduces X-MoRe, a novel inference method that enhances CLIP's zero-shot classification by leveraging cross-modal retrieval and confidence-based ensemble, improving performance without additional training.
Contribution
Proposes X-MoRe, a new inference technique combining cross-modal retrieval and confidence weighting to boost zero-shot classification performance of CLIP.
Findings
X-MoRe improves zero-shot classification accuracy across multiple tasks.
Utilizing external image-text pairs enhances CLIP's inference without retraining.
The method maintains robustness without additional training data.
Abstract
Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Language-Image Pre-training
