Fast Vision Transformers with HiLo Attention
Zizheng Pan, Jianfei Cai, Bohan Zhuang

TL;DR
This paper introduces LITv2, a fast and efficient Vision Transformer utilizing a novel HiLo attention mechanism that separates high and low frequency information for improved speed and performance across vision tasks.
Contribution
The paper proposes HiLo, a new self-attention mechanism that disentangles frequency patterns, and LITv2, a ViT model optimized for real-world speed on target platforms.
Findings
HiLo attention outperforms existing mechanisms in speed and memory efficiency.
LITv2 achieves faster inference speeds than comparable models on GPUs and CPUs.
LITv2 performs well across various vision tasks like classification, detection, and segmentation.
Abstract
Vision Transformers (ViTs) have triggered the most recent and significant breakthroughs in computer vision. Their efficient designs are mostly guided by the indirect metric of computational complexity, i.e., FLOPs, which however has a clear gap with the direct metric such as throughput. Thus, we propose to use the direct speed evaluation on the target platform as the design principle for efficient ViTs. Particularly, we introduce LITv2, a simple and effective ViT which performs favourably against the existing state-of-the-art methods across a spectrum of different model sizes with faster speed. At the core of LITv2 is a novel self-attention mechanism, which we dub HiLo. HiLo is inspired by the insight that high frequencies in an image capture local fine details and low frequencies focus on global structures, whereas a multi-head self-attention layer neglects the characteristic of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
