Faster Vision Mamba is Rebuilt in Minutes via Merged Token Re-training
Mingjia Shi, Yuhao Zhou, Ruiji Yu, Zekai Li, Zhiyuan Liang, Xuanlei, Zhao, Xiaojiang Peng, Shanmukha Ramakrishna Vedantam, Wangbo Zhao, Kai Wang,, Yang You

TL;DR
This paper introduces R-MeeTo, a rapid retraining framework for vision models that effectively recovers accuracy after token merging, enabling efficient compression with minimal performance loss.
Contribution
It proposes a fast retraining method post-token merging that preserves model accuracy, significantly reducing training time for compressed vision models.
Findings
Pruned models only lose up to 0.9% accuracy, recovered by R-MeeTo.
Achieves 35.9% accuracy improvement over 3 epochs on Vim-Ti.
Retrains Vim-Ti/S/B within 5/7/17 minutes with minimal accuracy drop.
Abstract
Vision Mamba has shown close to state of the art performance on computer vision tasks, drawing much interest in increasing it's efficiency. A promising approach is token reduction (that has been successfully implemented in ViTs). Pruning informative tokens in Mamba leads to a high loss of key knowledge and degraded performance. An alternative, of merging tokens preserves more information than pruning, also suffers for large compression ratios. Our key insight is that a quick round of retraining after token merging yeilds robust results across various compression ratios. Empirically, pruned Vims only drop up to 0.9% accuracy on ImageNet-1K, recovered by our proposed framework R-MeeTo in our main evaluation. We show how simple and effective the fast recovery can be achieved at minute-level, in particular, a 35.9% accuracy spike over 3 epochs of training on Vim-Ti. Moreover, Vim-Ti/S/B are…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper addresses a timely and practical problem: improving the computational efficiency of Vision Mamba, a promising new architecture. The initial diagnosis of the problem is correct. The field has indeed observed that Vision Mamba's sequential nature makes it highly sensitive to token reduction methods designed for ViTs. The proposed solution (merge + fast retrain) is simple and practical, and the reported re-training times (e.g., 5-17 minutes) are impressive.
The paper's entire argument and central claim hinge on the assertion that retraining is necessary because "training-free is not a good solution for... Mamba". This claim is unsubstantiated and invalidated by the authors' failure to cite or compare against MTR (Mamba Token Reduction) , a SOTA framework specifically designed for this problem and explicitly advertised as "training-free". Invalidated Conclusion: Because of this omission, the paper fails to prove its central thesis. It has not sh
1. The paper identifies and explains a key inefficiency in token pruning for Mamba models. 2. The paper provides a clear information-theoretic analysis. 3. The propsed model achieves better results than the pruning baseline.
1. I'm concerning that the paper is comparing with only weak baselines. For example, the Transformer baseline is still DeiT with only 81.8% base-sized model accuracy on ImageNet, which is outdated. A reasonable ViT-Base performance on ImageNet should be over 83.0% (e.g., DeiT-III [1]). Does the conclusion "Mamba is more sensitive that Transformers for pruning" still hold true for DeiT-III? 2. The scalability is unknown. The paper only conduct experiments with base-level models with less than 10
1. Simple and easy to follow. The merging and re-training mechanisms are not complex and R-MeeTo yields smaller drops at matched or lower FLOPs. This indicates competitive effectiveness at comparable cost. The experiment tables also show that the throughput improves across GPUs with modest accuracy loss. Settings are detailed and this aids replication. 2. Experiments are designed across image and video models and results on Kinetics-400 show consistent behavior and FLOP reductions with small dro
1. Theoretical assumptions and derivations need tighter justification. Assumptions are strong and unvalidated empirically. Consequences for misspecification are not analyzed in Sec. 2.3. This affects technical soundness. The proof of Theorem 1 uses interaction information decompositions with limited rigor. Proposition 2 (no dependency before t) may be an oversimplification for selective SSMs due to the diverse scan order and layer settings of common models such as Vim model. This impacts correct
The discussion on token merge and token prunning for Mamba is interesting. This work bring some insights to the efficient vision mamba design.
The token merging strategy is wildly explore by well-known untrained method on Vision Transformers. Adaptation to Mamba seems like a natural extension. The contribution is limited.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Analysis
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Pruning
