Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification   with Cross-Modal Retrieval

Seongha Eom; Namgyu Ho; Jaehoon Oh; Se-Young Yun

arXiv:2308.15273·cs.CV·August 30, 2023

Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval

Seongha Eom, Namgyu Ho, Jaehoon Oh, Se-Young Yun

PDF

Open Access

TL;DR

This paper introduces X-MoRe, a novel inference method that enhances CLIP's zero-shot classification by leveraging cross-modal retrieval and confidence-based ensemble, improving performance without additional training.

Contribution

Proposes X-MoRe, a new inference technique combining cross-modal retrieval and confidence weighting to boost zero-shot classification performance of CLIP.

Findings

01

X-MoRe improves zero-shot classification accuracy across multiple tasks.

02

Utilizing external image-text pairs enhances CLIP's inference without retraining.

03

The method maintains robustness without additional training data.

Abstract

Contrastive language-image pre-training (CLIP) has demonstrated remarkable zero-shot classification ability, namely image classification using novel text labels. Existing works have attempted to enhance CLIP by fine-tuning on downstream tasks, but these have inadvertently led to performance degradation on unseen classes, thus harming zero-shot generalization. This paper aims to address this challenge by leveraging readily available image-text pairs from an external dataset for cross-modal guidance during inference. To this end, we propose X-MoRe, a novel inference method comprising two key steps: (1) cross-modal retrieval and (2) modal-confidence-based ensemble. Given a query image, we harness the power of CLIP's cross-modal representations to retrieve relevant textual information from an external image-text pair dataset. Then, we assign higher weights to the more reliable modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsContrastive Language-Image Pre-training