ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers
Yutong Xie, Jianpeng Zhang, Yong Xia, Anton van den Hengel, and Qi Wu

TL;DR
ClusTR introduces a clustering-based sparse self-attention mechanism for vision transformers, significantly reducing computational complexity while maintaining high accuracy on vision tasks.
Contribution
The paper proposes a novel clustering-guided sparse attention method and extends it to multi-scale, achieving state-of-the-art results with fewer parameters and lower computational costs.
Findings
Achieves 83.2% Top-1 accuracy on ImageNet with fewer parameters.
Reduces computational cost compared to dense self-attention.
Effective for dense prediction tasks with multi-scale clustering.
Abstract
Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Softmax · Layer Normalization · Dropout · Dense Connections · Adam · Position-Wise Feed-Forward Layer
