QuadTree Attention for Vision Transformers

Shitao Tang; Jiahui Zhang; Siyu Zhu; Ping Tan

arXiv:2201.02767·cs.CV·March 25, 2022·70 cites

QuadTree Attention for Vision Transformers

Shitao Tang, Jiahui Zhang, Siyu Zhu, Ping Tan

PDF

Open Access 1 Repo 1 Video

TL;DR

QuadTree Attention introduces a linear-complexity attention mechanism for vision transformers, enabling efficient dense predictions and achieving state-of-the-art results across multiple vision tasks.

Contribution

It proposes a novel quadtree-based attention method that reduces complexity from quadratic to linear, improving efficiency and performance in vision transformers.

Findings

01

Achieves 4.0% improvement in feature matching on ScanNet

02

Reduces FLOPs by about 50% in stereo matching

03

Improves top-1 accuracy on ImageNet by 0.4-1.5%

Abstract

Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring dense predictions, such as object detection, feature matching, stereo, etc. We introduce QuadTree Attention, which reduces the computational complexity from quadratic to linear. Our quadtree transformer builds token pyramids and computes attention in a coarse-to-fine manner. At each level, the top K patches with the highest attention scores are selected, such that at the next level, attention is only evaluated within the relevant regions corresponding to these top K patches. We demonstrate that quadtree attention achieves state-of-the-art performance in various vision tasks, e.g. with 4.0% improvement in feature matching on ScanNet, about 50% flops…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tangshitao/quadtreeattention
pytorchOfficial

Videos

Quadtree Attention for Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications