MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

Simin Huo; Ning Li

arXiv:2604.13432·cs.CV·April 16, 2026

MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

Simin Huo, Ning Li

PDF

1 Repo

TL;DR

MaMe and MaRe are matrix-based, GPU-efficient token merging and restoration methods that significantly accelerate vision transformers and image synthesis with minimal accuracy loss.

Contribution

Introduction of MaMe and MaRe, novel matrix-based, training-free token merging and restoration techniques that improve efficiency and performance in vision models.

Findings

01

MaMe doubles ViT-B throughput with only 2% accuracy drop.

02

Fine-tuning last layer with MaMe boosts ViT-B accuracy by 1.0%.

03

MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400.

Abstract

Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cominder/mame
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.