Lossless Token Merging Even Without Fine-Tuning in Vision Transformers
Jaeyeon Lee, Dong-Wan Choi

TL;DR
This paper introduces Adaptive Token Merging (ATM), a lossless, training-free method for reducing tokens in Vision Transformers, significantly decreasing computational costs without sacrificing accuracy.
Contribution
ATM is a novel, adaptive token merging technique that prevents information loss and eliminates the need for fine-tuning in Vision Transformers.
Findings
ATM outperforms existing training-free methods.
Achieves over 30% FLOPs reduction without accuracy loss.
Surpasses many training-based approaches without additional training.
Abstract
Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this issue, but they often suffer from severe information loss, requiring extensive additional training to achieve practical performance. In this paper, we propose Adaptive Token Merging (ATM), a novel method that ensures lossless token merging, eliminating the need for fine-tuning while maintaining competitive performance. ATM adaptively reduces tokens across layers and batches by carefully adjusting layer-specific similarity thresholds, thereby preventing the undesirable merging of less similar tokens with respect to each layer. Furthermore, ATM introduces a novel token matching technique that considers not only similarity but also merging sizes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
