Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers
Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi and, Graham W. Taylor, Florian Shkurti

TL;DR
Sparsifiner introduces a novel method for learning instance-dependent, sparse attention patterns in Vision Transformers, significantly reducing computational costs while maintaining high accuracy by leveraging a lightweight connectivity predictor.
Contribution
The paper proposes a new approach to learn unstructured, instance-dependent attention masks in ViT, enabling efficient sparse attention with minimal accuracy loss.
Findings
Reduces 48% to 69% FLOPs with less than 0.4% accuracy drop.
Achieves over 60% FLOPs reduction by combining attention and token sparsity.
Outperforms fixed-pattern sparsity methods in Pareto efficiency.
Abstract
Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) though they often come with high computational costs. To this end, previous methods explore different attention patterns by limiting a fixed number of spatially nearby tokens to accelerate the ViT's multi-head self-attention (MHSA) operations. However, such structured attention patterns limit the token-to-token connections to their spatial relevance, which disregards learned semantic connections from a full attention mask. In this work, we propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module to estimate the connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
