Vision Transformer for Contrastive Clustering
Hua-Bao Ling, Bowen Zhu, Dong Huang, Ding-Hua Chen, Chang-Dong Wang,, Jian-Huang Lai

TL;DR
This paper introduces VTCC, a novel deep clustering method that combines Vision Transformer and contrastive learning to improve image clustering performance and stability, addressing limitations of previous CNN-based approaches.
Contribution
It unifies Vision Transformer with contrastive learning for the first time in image clustering, incorporating a convolutional stem for stability and dual projectors for instance and cluster learning.
Findings
VTCC outperforms state-of-the-art methods on eight datasets.
The approach demonstrates stable training from scratch.
It achieves superior clustering accuracy and robustness.
Abstract
Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another popular research topic recently. While previous contrastive learning works are mostly based on CNNs, some recent studies have attempted to combine ViT and contrastive learning for enhanced self-supervised learning. Despite the considerable progress, these combinations of ViT and contrastive learning mostly focus on the instance-level contrastiveness, which often overlook the global contrastiveness and also lack the ability to directly learn the clustering result (e.g., for images). In view of this, this paper presents a novel deep clustering approach termed Vision Transformer for Contrastive Clustering (VTCC), which for the first time, to our knowledge,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Video Surveillance and Tracking Methods · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Label Smoothing · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections
