Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

Phat Nguyen; Ngai-Man Cheung

arXiv:2507.09702·cs.CV·July 15, 2025

Token Compression Meets Compact Vision Transformers: A Survey and Comparative Evaluation for Edge AI

Phat Nguyen, Ngai-Man Cheung

PDF

Open Access

TL;DR

This paper provides a comprehensive survey and evaluation of token compression techniques for Vision Transformers, highlighting their effectiveness on standard models and challenges when applied to compact, edge-deployable architectures.

Contribution

It offers the first systematic taxonomy and comparative analysis of token compression methods across different ViT architectures and deployment scenarios.

Findings

01

Token compression improves inference speed on standard ViTs.

02

Methods often underperform on compact, resource-constrained models.

03

Insights suggest need for adapting techniques for edge AI applications.

Abstract

Token compression techniques have recently emerged as powerful tools for accelerating Vision Transformer (ViT) inference in computer vision. Due to the quadratic computational complexity with respect to the token sequence length, these methods aim to remove less informative tokens before the attention layers to improve inference throughput. While numerous studies have explored various accuracy-efficiency trade-offs on large-scale ViTs, two critical gaps remain. First, there is a lack of unified survey that systematically categorizes and compares token compression approaches based on their core strategies (e.g., pruning, merging, or hybrid) and deployment settings (e.g., fine-tuning vs. plug-in). Second, most benchmarks are limited to standard ViT models (e.g., ViT-B, ViT-L), leaving open the question of whether such methods remain effective when applied to structurally compressed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors

MethodsDropout · Vision Transformer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer