Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification
Chao Yi, Lu Ren, De-Chuan Zhan, Han-Jia Ye

TL;DR
This paper introduces CODER, a novel cross-modal neighbor representation method that improves CLIP's image classification by aligning image features with generated text neighbors, enhancing performance in zero-shot and few-shot tasks.
Contribution
The paper proposes a new neighbor-based feature extraction method, CODER, and an auto text generator, ATG, to better leverage CLIP's cross-modal capabilities for classification.
Findings
CODER improves CLIP's classification accuracy across multiple datasets.
Auto Text Generator (ATG) produces diverse texts without additional training.
Enhanced feature alignment leads to better zero-shot and few-shot performance.
Abstract
CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However, without specific optimization for unimodal scenarios, its performance in single-modality feature extraction might be suboptimal. Despite this, some studies have directly used CLIP's image encoder for tasks like few-shot classification, introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation, adversely affecting CLIP's effectiveness in target tasks. In this paper, we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Speech Recognition and Synthesis
MethodsContrastive Learning
