Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He; Xianghan Meng; Zhiyuan Huang; Xianbiao Qi; Rong Xiao; Chun-Guang Li

arXiv:2602.19910·cs.CV·April 1, 2026

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li

PDF

TL;DR

This paper introduces SSR$^2$-GCD, a multi-modal learning framework for generalized category discovery that emphasizes intra-modality alignment and leverages vision-language models for improved recognition of known and unknown categories.

Contribution

The paper proposes a novel semi-supervised rate reduction method for multi-modal GCD, emphasizing intra-modality alignment and integrating vision-language prompts for enhanced knowledge transfer.

Findings

01

Outperforms existing methods on benchmark datasets.

02

Effectively learns cross-modality representations with desired structural properties.

03

Leverages vision-language models to improve recognition of unknown categories.

Abstract

Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR $^{2}$ -GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.