Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification
Xi Yang, Pai Peng, Wulin Xie, Xiaohuan Lu, Jie Wen

TL;DR
This paper introduces Cross-Modal Mapping (CMM), a novel method that aligns image features with text features to improve few-shot image classification by reducing the modality gap in pre-trained models.
Contribution
CMM globally aligns image features with text features using linear transformation and triplet loss, enhancing cross-modal consistency and classification performance.
Findings
CMM improves Top-1 accuracy by 1.06% on 11 datasets.
CMM simplifies training and increases efficiency.
CMM performs well under distribution shifts.
Abstract
Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications
MethodsContrastive Language-Image Pre-training · Triplet Loss
