LookupViT: Compressing visual information to a limited number of tokens

Rajat Koner; Gagan Jain; Prateek Jain; Volker Tresp; Sujoy Paul

arXiv:2407.12753·cs.CV·July 18, 2024

LookupViT: Compressing visual information to a limited number of tokens

Rajat Koner, Gagan Jain, Prateek Jain, Volker Tresp, Sujoy Paul

PDF

Open Access

TL;DR

LookupViT introduces a novel method to compress and process visual information in vision transformers, significantly reducing inference costs while maintaining or improving accuracy across multiple vision tasks.

Contribution

It proposes a general-purpose vision transformer block that compresses high-resolution tokens, enabling efficient processing and broad applicability to various ViT variants and tasks.

Findings

01

2x reduction in FLOPs across multiple domains

02

Maintains or improves accuracy compared to standard ViT

03

Enhances robustness and generalization on corrupted datasets

Abstract

Vision Transformers (ViT) have emerged as the de-facto choice for numerous industry grade vision solutions. But their inference cost can be prohibitive for many settings, as they compute self-attention in each layer which suffers from quadratic computational complexity in the number of tokens. On the other hand, spatial information in images and spatio-temporal information in videos is usually sparse and redundant. In this work, we introduce LookupViT, that aims to exploit this information sparsity to reduce ViT inference cost. LookupViT provides a novel general purpose vision transformer block that operates by compressing information from higher resolution tokens to a fixed number of tokens. These few compressed tokens undergo meticulous processing, while the higher-resolution tokens are passed through computationally cheaper layers. Information sharing between these two token sets is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Video Analysis and Summarization

MethodsAttention Is All You Need · Softmax · Residual Connection · Layer Normalization · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer