FilterViT and DropoutViT
Bohang Sun (School of Information, Software Engineering, University, of Electronic Science, Technology of China, Chengdu, China)

TL;DR
This paper introduces FilterViT and DropoutViT, which enhance Vision Transformers by using attention-based downsampling with salient masks to improve efficiency, interpretability, and accuracy.
Contribution
The paper proposes a novel attention-based downsampling method using a filter block to select important pixels, reducing complexity and increasing interpretability.
Findings
Reduces computational complexity and speeds up processing.
Improves parameter efficiency and accuracy.
Maintains high performance with fewer resources.
Abstract
In this study, we introduce an enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling. Performing attention directly on high-resolution feature maps is computationally demanding due to the large size and numerous tokens. To mitigate this, we propose a filter attention mechanism that uses a Filter Block to create a salient mask (Filter Mask) for selecting the most informative pixels for attention. The Filter Block scores the pixels of the feature map, and we sort these scores to retain only the top K pixels (with K varying across layers). This approach effectively decreases the number of tokens involved in the attention computation, reducing computational complexity and boosting processing speed. Furthermore, the salient mask provides interpretability, as the model focuses on regions of the image most critical to the outcome.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTime Series Analysis and Forecasting
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · MobileViT
