Leveraging Cross-Modal Neighbor Representation for Improved CLIP   Classification

Chao Yi; Lu Ren; De-Chuan Zhan; Han-Jia Ye

arXiv:2404.17753·cs.CV·April 30, 2024

Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification

Chao Yi, Lu Ren, De-Chuan Zhan, Han-Jia Ye

PDF

Open Access 1 Repo

TL;DR

This paper introduces CODER, a novel cross-modal neighbor representation method that improves CLIP's image classification by aligning image features with generated text neighbors, enhancing performance in zero-shot and few-shot tasks.

Contribution

The paper proposes a new neighbor-based feature extraction method, CODER, and an auto text generator, ATG, to better leverage CLIP's cross-modal capabilities for classification.

Findings

01

CODER improves CLIP's classification accuracy across multiple datasets.

02

Auto Text Generator (ATG) produces diverse texts without additional training.

03

Enhanced feature alignment leads to better zero-shot and few-shot performance.

Abstract

CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ycaigogogo/cvpr24-coder
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Speech Recognition and Synthesis

MethodsContrastive Learning