FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot   Cross-modal Retrieval

Jingyou Xie; Jiayi Kuang; Zhenzhou Lin; Jiarui Ouyang; Zishuo Zhao,; Ying Shen

arXiv:2411.17454·cs.CV·November 27, 2024

FLEX-CLIP: Feature-Level GEneration Network Enhanced CLIP for X-shot Cross-modal Retrieval

Jingyou Xie, Jiayi Kuang, Zhenzhou Lin, Jiarui Ouyang, Zishuo Zhao,, Ying Shen

PDF

Open Access

TL;DR

FLEX-CLIP enhances few-shot cross-modal retrieval by generating pseudo features and fusing features to reduce degradation, significantly improving performance over existing methods.

Contribution

The paper introduces FLEX-CLIP, a novel framework combining feature generation and fusion techniques to address data imbalance and feature degradation in few-shot cross-modal retrieval.

Findings

01

Achieves 7%-15% performance improvement on benchmarks.

02

Effectively reduces feature degradation in X-shot scenarios.

03

Demonstrates the effectiveness of feature-level generation and fusion.

Abstract

Given a query from one modality, few-shot cross-modal retrieval (CMR) retrieves semantically similar instances in another modality with the target domain including classes that are disjoint from the source domain. Compared with classical few-shot CMR methods, vision-language pretraining methods like CLIP have shown great few-shot or zero-shot learning performance. However, they still suffer challenges due to (1) the feature degradation encountered in the target domain and (2) the extreme data imbalance. To tackle these issues, we propose FLEX-CLIP, a novel Feature-level Generation Network Enhanced CLIP. FLEX-CLIP includes two training stages. In multimodal feature generation, we propose a composite multimodal VAE-GAN network to capture real feature distribution patterns and generate pseudo samples based on CLIP features, addressing data imbalance. For common space projection, we develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training