Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

Xi Yang; Pai Peng; Wulin Xie; Xiaohuan Lu; Jie Wen

arXiv:2412.20110·cs.CV·February 17, 2026

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

Xi Yang, Pai Peng, Wulin Xie, Xiaohuan Lu, Jie Wen

PDF

Open Access

TL;DR

This paper introduces Cross-Modal Mapping (CMM), a novel method that aligns image features with text features to improve few-shot image classification by reducing the modality gap in pre-trained models.

Contribution

CMM globally aligns image features with text features using linear transformation and triplet loss, enhancing cross-modal consistency and classification performance.

Findings

01

CMM improves Top-1 accuracy by 1.06% on 11 datasets.

02

CMM simplifies training and increases efficiency.

03

CMM performs well under distribution shifts.

Abstract

Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications

MethodsContrastive Language-Image Pre-training · Triplet Loss