SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications
Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan,, Ming-Hsuan Yang, Fahad Shahbaz Khan

TL;DR
SwiftFormer introduces an efficient additive attention mechanism that replaces quadratic matrix operations with linear element-wise multiplications, enabling high-accuracy, real-time vision models suitable for mobile devices.
Contribution
The paper proposes a novel additive attention mechanism that reduces computational complexity and can be used throughout the network, improving speed and accuracy for mobile vision applications.
Findings
Achieves 78.5% top-1 ImageNet accuracy on small model
Runs at 0.8 ms latency on iPhone 14
Outperforms MobileViT-v2 in speed and accuracy
Abstract
Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗MBZUAI/swiftformer-smodel· 518 dl· ♡ 1518 dl♡ 1
- 🤗MBZUAI/swiftformer-l1model· 9 dl· ♡ 19 dl♡ 1
- 🤗MBZUAI/swiftformer-l3model· 12 dl· ♡ 312 dl♡ 3
- 🤗timm/swiftformer_l1.dist_in1kmodel· 322 dl322 dl
- 🤗timm/swiftformer_l3.dist_in1kmodel· 50 dl50 dl
- 🤗timm/swiftformer_s.dist_in1kmodel· 138 dl138 dl
- 🤗timm/swiftformer_xs.dist_in1kmodel· 612 dl612 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Brain Tumor Detection and Classification
MethodsTanh Activation · Linear Layer
