Vision Transformer for Contrastive Clustering

Hua-Bao Ling; Bowen Zhu; Dong Huang; Ding-Hua Chen; Chang-Dong Wang,; Jian-Huang Lai

arXiv:2206.12925·cs.CV·July 12, 2022·6 cites

Vision Transformer for Contrastive Clustering

Hua-Bao Ling, Bowen Zhu, Dong Huang, Ding-Hua Chen, Chang-Dong Wang,, Jian-Huang Lai

PDF

Open Access 1 Repo

TL;DR

This paper introduces VTCC, a novel deep clustering method that combines Vision Transformer and contrastive learning to improve image clustering performance and stability, addressing limitations of previous CNN-based approaches.

Contribution

It unifies Vision Transformer with contrastive learning for the first time in image clustering, incorporating a convolutional stem for stability and dual projectors for instance and cluster learning.

Findings

01

VTCC outperforms state-of-the-art methods on eight datasets.

02

The approach demonstrates stable training from scratch.

03

It achieves superior clustering accuracy and robustness.

Abstract

Vision Transformer (ViT) has shown its advantages over the convolutional neural network (CNN) with its ability to capture global long-range dependencies for visual representation learning. Besides ViT, contrastive learning is another popular research topic recently. While previous contrastive learning works are mostly based on CNNs, some recent studies have attempted to combine ViT and contrastive learning for enhanced self-supervised learning. Despite the considerable progress, these combinations of ViT and contrastive learning mostly focus on the instance-level contrastiveness, which often overlook the global contrastiveness and also lack the ability to directly learn the clustering result (e.g., for images). In view of this, this paper presents a novel deep clustering approach termed Vision Transformer for Contrastive Clustering (VTCC), which for the first time, to our knowledge,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jackkoling/vtcc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Video Surveillance and Tracking Methods · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Label Smoothing · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections