Speed-up of Vision Transformer Models by Attention-aware Token Filtering
Takahiro Naruko, Hiroaki Akutsu

TL;DR
This paper introduces Attention-aware Token Filtering (ATF), a method to accelerate Vision Transformer models by dynamically filtering tokens, achieving 2.8x speed-up without sacrificing accuracy.
Contribution
The paper proposes a novel token filtering module and strategy that speed up ViT models without modifying the transformer encoder or losing performance.
Findings
ATF achieves 2.8x speed-up on retrieval tasks.
ATF maintains retrieval recall rate.
The method filters tokens dynamically based on object regions and attention.
Abstract
Vision Transformer (ViT) models have made breakthroughs in image embedding extraction, which provide state-of-the-art performance in tasks such as zero-shot image classification. However, the models suffer from a high computational burden. In this paper, we propose a novel speed-up method for ViT models called Attention-aware Token Filtering (ATF). ATF consists of two main ideas: a novel token filtering module and a filtering strategy. The token filtering module is introduced between a tokenizer and a transformer encoder of the ViT model, without modifying or fine-tuning of the transformer encoder. The module filters out tokens inputted to the encoder so that it keeps tokens in regions of specific object types dynamically and keeps tokens in regions that statically receive high attention in the transformer encoder. This filtering strategy maintains task accuracy while filtering out…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Neural Networks and Applications · Infrared Target Detection Methodologies
