FilterViT and DropoutViT

Bohang Sun (School of Information; Software Engineering; University; of Electronic Science; Technology of China; Chengdu; China)

arXiv:2410.22709·cs.CV·November 12, 2024

FilterViT and DropoutViT

Bohang Sun (School of Information, Software Engineering, University, of Electronic Science, Technology of China, Chengdu, China)

PDF

Open Access

TL;DR

This paper introduces FilterViT and DropoutViT, which enhance Vision Transformers by using attention-based downsampling with salient masks to improve efficiency, interpretability, and accuracy.

Contribution

The paper proposes a novel attention-based downsampling method using a filter block to select important pixels, reducing complexity and increasing interpretability.

Findings

01

Reduces computational complexity and speeds up processing.

02

Improves parameter efficiency and accuracy.

03

Maintains high performance with fewer resources.

Abstract

In this study, we introduce an enhanced version of ViT that conducts attention-based QKV operations during the initial stages of downsampling. Performing attention directly on high-resolution feature maps is computationally demanding due to the large size and numerous tokens. To mitigate this, we propose a filter attention mechanism that uses a Filter Block to create a salient mask (Filter Mask) for selecting the most informative pixels for attention. The Filter Block scores the pixels of the feature map, and we sort these scores to retain only the top K pixels (with K varying across layers). This approach effectively decreases the number of tokens involved in the attention computation, reducing computational complexity and boosting processing speed. Furthermore, the salient mask provides interpretability, as the model focuses on regions of the image most critical to the outcome.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · MobileViT