FasterViT: Fast Vision Transformers with Hierarchical Attention

Ali Hatamizadeh; Greg Heinrich; Hongxu Yin; Andrew Tao; Jose M.; Alvarez; Jan Kautz; Pavlo Molchanov

arXiv:2306.06189·cs.CV·April 3, 2024·35 cites

FasterViT: Fast Vision Transformers with Hierarchical Attention

Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M., Alvarez, Jan Kautz, Pavlo Molchanov

PDF

Open Access 2 Repos 2 Models 1 Video 3 Reviews

TL;DR

FasterViT introduces a hierarchical attention mechanism in hybrid CNN-ViT models, significantly improving image processing speed and accuracy across various computer vision tasks by reducing global self-attention complexity.

Contribution

The paper presents a novel Hierarchical Attention (HAT) approach that enhances global self-attention efficiency and can be integrated into existing networks for improved performance.

Findings

01

Achieves state-of-the-art accuracy and throughput in CV tasks

02

Enables faster processing of high-resolution images

03

HAT improves existing models as a plug-and-play module

Abstract

We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification,…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. Intuitive idea: combing conv + attention is not new, and is often considered more efficient than pure conv or pure transformer for image processing. Window attention is also not new, but the hierarchical window attention is interesting. 2. Impressive results: results in Figure 1 are pretty impressive. FastViT outperforms other models by pretty good margin on ImageNet. 3. Well written paper and easy to follow.

Weaknesses

1. It is unclear how significant the proposed hierarchical attention (HAT) is. Table 5 shows this HAT is better than Twins and EdgeViT; however, Table 7 show only marginal gains when comparing to the vanilla SwinTransformer if treating HAT as a plug-and-play module. 2. The idea of HAT is not well motivated. Though it shows good empirical results, it is unclear why simply adding a per-window CT token can significantly improve quality. Would be nice to add a few more ablation studies or insight

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 5

Strengths

- The paper is overall well written and organized. The motivation of proposing the HTA module is technically sound - it is a good practice to do more memory intensive operations in early stages while putting the computational intensive operations to the later stage. This is also verified in the experiments, where the proposed model is more GPU friendly compared to existing models. - The HAT module is not unnecessarily complex and intuitively easy to implement. It is also shown that it can be a p

Weaknesses

The major weakness of this work is experiment. The paper claims that with HAT and all the optimization regarding model architecture, the model has much less complexity compared to conventional attention and it compares the proposed model to a few efficient ViT. However, the comparison is on A100 GPU only and there is no comparison on any other platforms. Some recent works, such as EfficientFormer, FastViT and NextViT compared in Figure 1, all benchmarked their models on different mobile platform

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- FasterViT is tailored for high-resolution input images and demonstrates faster image throughput compared to competitive models, particularly in handling images with higher resolutions. - The proposed HAT approach efficiently decomposes global self-attention into a multi-level attention mechanism, reducing computational complexity and enabling effective local and global representation learning. Overall, the idea of carrier tokens is novel and interesting. - The paper extensively validates Faste

Weaknesses

- The comparisons in Table 5 and 7 demonstrate minor improvements in terms of accuracy, while the throughput in Table 5 is reduced when using HAT. - Performance is compared on A100 GPUs. More platforms should be used to see if throughput results are consistent. At present results are not conclusive.

Code & Models

Repositories

Models

Videos

FasterViT: Fast Vision Transformers with Hierarchical Attention· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Domain Adaptation and Few-Shot Learning

MethodsFocus