Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration
Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang

TL;DR
This paper introduces a unified, training-free token transformation framework for vision transformers that reduces computational costs by compressing tokens through explicit matrix transformations, improving efficiency with minimal accuracy loss.
Contribution
It unifies existing token compression methods into a general, training-free matrix transformation framework, enabling effective acceleration of vision transformers across various tasks.
Findings
Reduced 40% FLOPs in DeiT-S with only 0.1% accuracy drop
Achieved 1.5x acceleration on DeiT-S
Extended the method to multiple dense prediction tasks with consistent improvements
Abstract
Vision transformers have been widely explored in various vision tasks. Due to heavy computational cost, much interest has aroused for compressing vision transformer dynamically in the aspect of tokens. Current methods mainly pay attention to token pruning or merging to reduce token numbers, in which tokens are compressed exclusively, causing great information loss and therefore post-training is inevitably required to recover the performance. In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework. Furthermore, we propose a many-to-many Token Transforming framework that serves as a generalization of all existing methods and reserves the most information, even enabling training-free acceleration. We conduct extensive experiments to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
MethodsDense Connections · Layer Normalization · Vision Transformer · Pruning
