ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
Bencheng Liao, Xinggang Wang, Lianghui Zhu, Qian Zhang, Chang Huang

TL;DR
ViG introduces Gated Linear Attention to improve the efficiency and speed of vision models, achieving high accuracy with fewer parameters and FLOPs, and faster runtime on various image resolutions.
Contribution
The paper proposes Gated Linear Attention and a hardware-aware implementation for vision models, significantly enhancing speed and efficiency while maintaining high accuracy.
Findings
ViG-S matches DeiT-B accuracy with 73% fewer parameters.
ViG-T achieves 20.7% top-1 accuracy at 1024x1024 resolution, outperforming DeiT-T.
Model runs 2x faster on 224x224 images and 4.8x faster at higher resolutions.
Abstract
Recently, linear complexity sequence modeling networks have achieved modeling capabilities similar to Vision Transformers on a variety of computer vision tasks, while using fewer FLOPs and less memory. However, their advantage in terms of actual runtime speed is not significant. To address this issue, we introduce Gated Linear Attention (GLA) for vision, leveraging its superior hardware-awareness and efficiency. We propose direction-wise gating to capture 1D global context through bidirectional modeling and a 2D gating locality injection to adaptively inject 2D local details into 1D global context. Our hardware-aware implementation further merges forward and backward scanning into a single kernel, enhancing parallelism and reducing memory cost and latency. The proposed model, ViG, offers a favorable trade-off in accuracy, parameters, and FLOPs on ImageNet and downstream tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
