CLIP-GCD: Simple Language Guided Generalized Category Discovery
Rabah Ouldnoughi, Chia-Wen Kuo, Zsolt Kira

TL;DR
This paper introduces CLIP-GCD, a novel approach for generalized category discovery that leverages vision-language models and retrieval-based techniques to improve clustering of known and unknown categories, especially out-of-distribution ones.
Contribution
The paper proposes a new method combining CLIP's vision-language features with a retrieval mechanism for enhanced semi-supervised clustering in GCD tasks, outperforming prior approaches.
Findings
State-of-the-art results on multiple datasets
Effective handling of out-of-distribution categories
Improved clustering accuracy through retrieval-based feature augmentation
Abstract
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data. Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods. In this paper, we posit that such methods are still prone to poor performance on out-of-distribution categories, and do not leverage a key ingredient: Semantic relationships between object categories. We therefore propose to leverage multi-modal (vision and language) models, in two complementary ways. First, we establish a strong baseline by replacing uni-modal features with CLIP, inspired by its zero-shot performance. Second, we propose a novel retrieval-based mechanism that leverages CLIP's aligned vision-language representations by mining text descriptions from a text corpus for the labeled and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
