TL;DR
MaMe and MaRe are matrix-based, GPU-efficient token merging and restoration methods that significantly accelerate vision transformers and image synthesis with minimal accuracy loss.
Contribution
Introduction of MaMe and MaRe, novel matrix-based, training-free token merging and restoration techniques that improve efficiency and performance in vision models.
Findings
MaMe doubles ViT-B throughput with only 2% accuracy drop.
Fine-tuning last layer with MaMe boosts ViT-B accuracy by 1.0%.
MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400.
Abstract
Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
