ClusTR: Exploring Efficient Self-attention via Clustering for Vision   Transformers

Yutong Xie; Jianpeng Zhang; Yong Xia; Anton van den Hengel; and Qi Wu

arXiv:2208.13138·cs.CV·August 30, 2022·6 cites

ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers

Yutong Xie, Jianpeng Zhang, Yong Xia, Anton van den Hengel, and Qi Wu

PDF

Open Access

TL;DR

ClusTR introduces a clustering-based sparse self-attention mechanism for vision transformers, significantly reducing computational complexity while maintaining high accuracy on vision tasks.

Contribution

The paper proposes a novel clustering-guided sparse attention method and extends it to multi-scale, achieving state-of-the-art results with fewer parameters and lower computational costs.

Findings

01

Achieves 83.2% Top-1 accuracy on ImageNet with fewer parameters.

02

Reduces computational cost compared to dense self-attention.

03

Effective for dense prediction tasks with multi-scale clustering.

Abstract

Although Transformers have successfully transitioned from their language modelling origins to image-based applications, their quadratic computational complexity remains a challenge, particularly for dense prediction. In this paper we propose a content-based sparse attention method, as an alternative to dense self-attention, aiming to reduce the computation complexity while retaining the ability to model long-range dependencies. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost. Besides, we further extend the clustering-guided attention from single-scale to multi-scale, which is conducive to dense prediction tasks. We label the proposed Transformer architecture ClusTR,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Softmax · Layer Normalization · Dropout · Dense Connections · Adam · Position-Wise Feed-Forward Layer