Delving into Spectral Clustering with Vision-Language Representations

Bo Peng; Yuanwei Hu; Bo Liu; Ling Chen; Jie Lu; Zhen Fang

arXiv:2602.09586·cs.CV·March 17, 2026

Delving into Spectral Clustering with Vision-Language Representations

Bo Peng, Yuanwei Hu, Bo Liu, Ling Chen, Jie Lu, Zhen Fang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a multi-modal spectral clustering approach leveraging vision-language models and neural tangent kernels, significantly improving clustering performance across diverse datasets.

Contribution

It extends spectral clustering to a multi-modal setting using vision-language pre-training and neural tangent kernels, enhancing clustering accuracy and robustness.

Findings

01

Outperforms state-of-the-art on 16 benchmarks

02

Effectively leverages cross-modal alignment

03

Improves within-cluster connectivity

Abstract

Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1) The problem of considering multiple modalities of data in spectral clustering is interesting and on time given the current advances in multi-modal foundation models. 2) Using neural tangent kernel to anchor the data using multi-modal information is interesting and sound. 3) Better clustering performance can be observed on various image data.

Weaknesses

1) To consider both vision and language modalities in spectral clustering of images, this work extends existing techniques/strategies of "positive nouns", "neural tangent kernel", etc, which is making this work's technical contribution moderate. 2) It should be noted that many evaluation datasets used in this work have appeared in the training of the vision-language foundation models, which can raise a data-leakage problem since the representations under interest has been well obtained. Therefo

Reviewer 02Rating 4Confidence 4

Strengths

- The paper is easy to follow. - Eq. (8) decomposes the affinity into visual proximity and a text-induced overlap. It is easy to compute once noun logits are available. - The RAD update and its fixed-point interpretation are derived and come with a convergence argument for the linearized step. - Results cover classic, fine-grained, and domain-shift benchmarks. Ablations on \$\tau,q,\mu,\lambda\$ are provided.

Weaknesses

> 1. I wonder if negative weights violate standard SC assumptions. Since CLIP features lie on the unit sphere, \$\langle z_i,z_j\rangle\in[-1,1]\$ (Eq. (1)). With Eq. (8), the NTK inherits negative values whenever the cosine is negative: $$ K_{\theta_0}(z_i,z_j)=\frac{1}{\tau^2}\langle z_i,z_j\rangle \sum_k s_i[k]s_j[k]. $$ Yet the method sets \$A_{ij}=K_{\theta_0}\$ on mutual \$q\$-NN (Eq. (6)) without rectification or signed-graph treatment. Normalized-cut SC (Eq. (2)) typically pres

Reviewer 03Rating 8Confidence 3

Strengths

+ The integration of neural tangent kernel theory with vision language representations for spectral clustering is novel. The idea of anchoring NTK with positive nouns to create semantically aware affinity matrices represents a way to incorporate linguistic priors into clustering. + The paper includes a comprehensive experimental validation, where testing is performed on 16 benchmarks. The consistent improvements suggest that the method is generalizable. + The presentation is excellent, the paper

Weaknesses

- The method uses some positive nouns that are semantically close to images of interest. This could limit practical applicability and introduce some bias. The paper should provide some analysis about it. - A minor weakness, is that the main paper is condensed and the related works in that are limited.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis