Token Pruning using a Lightweight Background Aware Vision Transformer
Sudhakar Sah, Ravish Kumar, Honnesh Rohmetra, Ehsan Saboori

TL;DR
This paper introduces BAViT, a background-aware token pruning method for Vision Transformers that reduces runtime memory and increases throughput on edge devices by effectively identifying and pruning background tokens.
Contribution
The paper presents a novel pre-processing block, BAViT, that classifies and prunes background tokens in ViT-based object detectors, improving efficiency without significant accuracy loss.
Findings
BAViT achieves 75-88% accuracy in background/foreground classification.
Using BAViT as pre-processor increases YOLOS throughput by 30-40%.
The approach maintains competitive mAP with minimal fine-tuning.
Abstract
High runtime memory and high latency puts significant constraint on Vision Transformer training and inference, especially on edge devices. Token pruning reduces the number of input tokens to the ViT based on importance criteria of each token. We present a Background Aware Vision Transformer (BAViT) model, a pre-processing block to object detection models like DETR/YOLOS aimed to reduce runtime memory and increase throughput by using a novel approach to identify background tokens in the image. The background tokens can be pruned completely or partially before feeding to a ViT based object detector. We use the semantic information provided by segmentation map and/or bounding box annotation to train a few layers of ViT to classify tokens to either foreground or background. Using 2 layers and 10 layers of BAViT, background and foreground tokens can be separated with 75% and 88% accuracy on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing Techniques and Applications
MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Attention Is All You Need
