QuadTree Attention for Vision Transformers
Shitao Tang, Jiahui Zhang, Siyu Zhu, Ping Tan

TL;DR
QuadTree Attention introduces a linear-complexity attention mechanism for vision transformers, enabling efficient dense predictions and achieving state-of-the-art results across multiple vision tasks.
Contribution
It proposes a novel quadtree-based attention method that reduces complexity from quadratic to linear, improving efficiency and performance in vision transformers.
Findings
Achieves 4.0% improvement in feature matching on ScanNet
Reduces FLOPs by about 50% in stereo matching
Improves top-1 accuracy on ImageNet by 0.4-1.5%
Abstract
Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring dense predictions, such as object detection, feature matching, stereo, etc. We introduce QuadTree Attention, which reduces the computational complexity from quadratic to linear. Our quadtree transformer builds token pyramids and computes attention in a coarse-to-fine manner. At each level, the top K patches with the highest attention scores are selected, such that at the next level, attention is only evaluated within the relevant regions corresponding to these top K patches. We demonstrate that quadtree attention achieves state-of-the-art performance in various vision tasks, e.g. with 4.0% improvement in feature matching on ScanNet, about 50% flops…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
