SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

Lorenzo Caselli; Marco Mistretta; Simone Magistri; Andrew D. Bagdanov

arXiv:2602.17395·cs.CV·February 20, 2026

SpectralGCD: Spectral Concept Selection and Cross-modal Representation Learning for Generalized Category Discovery

Lorenzo Caselli, Marco Mistretta, Simone Magistri, Andrew D. Bagdanov

PDF

Open Access 3 Reviews

TL;DR

SpectralGCD introduces a multimodal approach for generalized category discovery that leverages CLIP-based cross-modal similarities and spectral filtering to improve accuracy and efficiency across multiple benchmarks.

Contribution

It proposes SpectralGCD, a novel method combining spectral filtering and cross-modal representations for improved generalized category discovery.

Findings

01

Achieves state-of-the-art accuracy on six benchmarks.

02

Reduces computational cost compared to existing methods.

03

Maintains semantic quality through spectral filtering and knowledge distillation.

Abstract

Generalized Category Discovery (GCD) aims to identify novel categories in unlabeled data while leveraging a small labeled subset of known classes. Training a parametric classifier solely on image features often leads to overfitting to old classes, and recent multimodal approaches improve performance by incorporating textual information. However, they treat modalities independently and incur high computational cost. We propose SpectralGCD, an efficient and effective multimodal approach to GCD that uses CLIP cross-modal image-concept similarities as a unified cross-modal representation. Each image is expressed as a mixture over semantic concepts from a large task-agnostic dictionary, which anchors learning to explicit semantics and reduces reliance on spurious visual cues. To maintain the semantic quality of representations learned by an efficient student, we introduce Spectral Filtering…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- The idea of spectral filtering on a dictionary of concepts is cute. - The paper is quite complex. There are multiple stages to the proposed method. Yet the proposed method is still efficient.

Weaknesses

- The authors claim that the dictionary of concepts is task agnostic. But this isn't really true right? There must be some overlap between the dictionary and the concepts of the target dataset. Otherwise it would not work. - GCD is useful for discovering concepts in a dataset that do not fit neatly into the existing label set. However, I would argue that the new assumptions that the authors are introducing render the task of GCD meaningless. - In particular, the authors use a Teacher model CLI

Reviewer 02Rating 4Confidence 5

Strengths

1. The paper addresses the Generalized Category Discovery task, focusing on the common problem where models overfit to the labeled "Old" classes and perform poorly on unlabeled "New" classes. 2. Proposed Core Idea: It introduces a novel "cross-modal representation" for each image. Instead of using raw image features, it represents an image as a vector of similarity scores against a large, task-agnostic dictionary of semantic concepts, computed using a pre-trained CLIP model. 3. To refine this

Weaknesses

1. The primary weakness lies in the justification for "Spectral Filtering". The motivation is to select "task-relevant" concepts. However, the mechanism (performing PCA on the global cross-modal covariance matrix) selects concepts that explain the most variance across the dataset. High variance does not necessarily equate to high discriminative power or task relevance. For example, a common background (e.g., 'sky', 'grass') present across many different classes could easily form a principal comp

Reviewer 03Rating 4Confidence 5

Strengths

(1) The idea of using the cross-modal representations is interesting. (2) The paper is clearly written and easy to follow. (3) The performance is promising.

Weaknesses

(1) Using VLMs (e.g., CLIP) for GCD risks data leakage, as these models may have been exposed to images or names of the “unknown” classes. Prior work (e.g., GET) evaluates on splits unseen by CLIP to mitigate this. Please discuss this issue and, if possible, include experiments on CLIP-unseen splits or provide a robustness analysis addressing the leakage problem. (2) What is the performance when using ViT-B/16 as the teacher or using ViT-H/14 as student? To what extent do the gains stem from di

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling