Token Pruning using a Lightweight Background Aware Vision Transformer

Sudhakar Sah; Ravish Kumar; Honnesh Rohmetra; Ehsan Saboori

arXiv:2410.09324·cs.CV·October 15, 2024

Token Pruning using a Lightweight Background Aware Vision Transformer

Sudhakar Sah, Ravish Kumar, Honnesh Rohmetra, Ehsan Saboori

PDF

Open Access

TL;DR

This paper introduces BAViT, a background-aware token pruning method for Vision Transformers that reduces runtime memory and increases throughput on edge devices by effectively identifying and pruning background tokens.

Contribution

The paper presents a novel pre-processing block, BAViT, that classifies and prunes background tokens in ViT-based object detectors, improving efficiency without significant accuracy loss.

Findings

01

BAViT achieves 75-88% accuracy in background/foreground classification.

02

Using BAViT as pre-processor increases YOLOS throughput by 30-40%.

03

The approach maintains competitive mAP with minimal fine-tuning.

Abstract

High runtime memory and high latency puts significant constraint on Vision Transformer training and inference, especially on edge devices. Token pruning reduces the number of input tokens to the ViT based on importance criteria of each token. We present a Background Aware Vision Transformer (BAViT) model, a pre-processing block to object detection models like DETR/YOLOS aimed to reduce runtime memory and increase throughput by using a novel approach to identify background tokens in the image. The background tokens can be pruned completely or partially before feeding to a ViT based object detector. We use the semantic information provided by segmentation map and/or bounding box annotation to train a few layers of ViT to classify tokens to either foreground or background. Using 2 layers and 10 layers of BAViT, background and foreground tokens can be separated with 75% and 88% accuracy on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing Techniques and Applications

MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Attention Is All You Need