Towards Efficient Vision State Space Models via Token Merging
Jinyoung Park, Minseok Son, Changick Kim

TL;DR
This paper introduces MaMe, a token-merging strategy for vision State Space Models that improves computational efficiency while maintaining performance, and generalizes well across vision, video, and audio tasks.
Contribution
MaMe is a novel token-merging method specifically designed for SSM-based vision models, addressing token importance and sequential property preservation.
Findings
MaMe achieves better efficiency-performance trade-offs.
It maintains robustness under aggressive token reduction.
Demonstrates strong generalization across multiple domains.
Abstract
State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
