CLIP-GCD: Simple Language Guided Generalized Category Discovery

Rabah Ouldnoughi; Chia-Wen Kuo; Zsolt Kira

arXiv:2305.10420·cs.CV·May 18, 2023·1 cites

CLIP-GCD: Simple Language Guided Generalized Category Discovery

Rabah Ouldnoughi, Chia-Wen Kuo, Zsolt Kira

PDF

Open Access

TL;DR

This paper introduces CLIP-GCD, a novel approach for generalized category discovery that leverages vision-language models and retrieval-based techniques to improve clustering of known and unknown categories, especially out-of-distribution ones.

Contribution

The paper proposes a new method combining CLIP's vision-language features with a retrieval mechanism for enhanced semi-supervised clustering in GCD tasks, outperforming prior approaches.

Findings

01

State-of-the-art results on multiple datasets

02

Effective handling of out-of-distribution categories

03

Improved clustering accuracy through retrieval-based feature augmentation

Abstract

Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data. Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods. In this paper, we posit that such methods are still prone to poor performance on out-of-distribution categories, and do not leverage a key ingredient: Semantic relationships between object categories. We therefore propose to leverage multi-modal (vision and language) models, in two complementary ways. First, we establish a strong baseline by replacing uni-modal features with CLIP, inspired by its zero-shot performance. Second, we propose a novel retrieval-based mechanism that leverages CLIP's aligned vision-language representations by mining text descriptions from a text corpus for the labeled and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Text and Document Classification Technologies · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training