GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang, Zhimao Peng, Zhengyuan Xie, Fei Yang, Xialei Liu,, Ming-Ming Cheng

TL;DR
This paper introduces a novel multi-modal approach using CLIP's visual and text features for generalized category discovery, significantly improving classification accuracy by synthesizing pseudo text embeddings and fusing modalities.
Contribution
It proposes a Text Embedding Synthesizer (TES) to generate pseudo text embeddings and a dual-branch framework for joint visual-text learning, unlocking CLIP's multi-modal potential for GCD.
Findings
Outperforms baseline methods on all GCD benchmarks
Achieves new state-of-the-art results
Effectively fuses visual and text modalities for better classification
Abstract
Given unlabelled datasets containing both old and new categories, generalized category discovery (GCD) aims to accurately discover new classes while correctly classifying old classes. Current GCD methods only use a single visual modality of information, resulting in a poor classification of visually similar classes. As a different modality, text information can provide complementary discriminative information, which motivates us to introduce it into the GCD task. However, the lack of class names for unlabelled data makes it impractical to utilize text information. To tackle this challenging problem, in this paper, we propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples. Specifically, our TES leverages the property that CLIP can generate aligned vision-language features, converting visual embeddings into tokens of the CLIP's text encoder to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Biomedical Text Mining and Ontologies
MethodsAttention Is All You Need · Softmax · Linear Layer · Contrastive Language-Image Pre-training · Multi-Head Attention · Synthesizer
