GET: Unlocking the Multi-modal Potential of CLIP for Generalized   Category Discovery

Enguang Wang; Zhimao Peng; Zhengyuan Xie; Fei Yang; Xialei Liu,; Ming-Ming Cheng

arXiv:2403.09974·cs.CV·March 24, 2025·1 cites

GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery

Enguang Wang, Zhimao Peng, Zhengyuan Xie, Fei Yang, Xialei Liu,, Ming-Ming Cheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel multi-modal approach using CLIP's visual and text features for generalized category discovery, significantly improving classification accuracy by synthesizing pseudo text embeddings and fusing modalities.

Contribution

It proposes a Text Embedding Synthesizer (TES) to generate pseudo text embeddings and a dual-branch framework for joint visual-text learning, unlocking CLIP's multi-modal potential for GCD.

Findings

01

Outperforms baseline methods on all GCD benchmarks

02

Achieves new state-of-the-art results

03

Effectively fuses visual and text modalities for better classification

Abstract

Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes. Current GCD methods only use a single visual modality of information, resulting in a poor classification of visually similar classes. As a different modality, text information can provide complementary discriminative information, which motivates us to introduce it into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

enguangw/get
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Biomedical Text Mining and Ontologies

MethodsAttention Is All You Need · Softmax · Linear Layer · Contrastive Language-Image Pre-training · Multi-Head Attention · Synthesizer