ToFe: Lagged Token Freezing and Reusing for Efficient Vision Transformer Inference
Haoyue Zhang, Jie Zhang, Song Guo

TL;DR
This paper introduces ToFe, a novel framework for vision transformers that freezes and reuses tokens to significantly reduce computation while maintaining high accuracy, suitable for resource-limited devices.
Contribution
ToFe is the first method to freeze and reuse tokens dynamically in vision transformers, balancing efficiency and performance through end-to-end training.
Findings
Reduces LV-ViT computational cost by 50%.
Achieves less than 2% accuracy drop.
Outperforms existing token reduction methods.
Abstract
Although vision transformers (ViT) have shown remarkable success in various vision tasks, their computationally expensive self-attention hinder their deployment on resource-constrained devices. Token reduction, which discards less important tokens during forward propagation, has been proposed to enhance the efficiency of transformer models. However, existing methods handle unimportant tokens irreversibly, preventing their reuse in subsequent blocks. Considering that transformers focus on different information among blocks, tokens reduced in early blocks might be useful later. Furthermore, to adapt transformer models for resource-constrained devices, it is crucial to strike a balance between model performance and computational overhead. To address these challenges, in this paper, we introduce a novel Token Freezing and Reusing (ToFe) framework, where we identify important tokens at each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
