Towards Efficient Vision State Space Models via Token Merging

Jinyoung Park; Minseok Son; Changick Kim

arXiv:2508.13599·cs.CV·August 20, 2025

Towards Efficient Vision State Space Models via Token Merging

Jinyoung Park, Minseok Son, Changick Kim

PDF

TL;DR

This paper introduces MaMe, a token-merging strategy for vision State Space Models that improves computational efficiency while maintaining performance, and generalizes well across vision, video, and audio tasks.

Contribution

MaMe is a novel token-merging method specifically designed for SSM-based vision models, addressing token importance and sequential property preservation.

Findings

01

MaMe achieves better efficiency-performance trade-offs.

02

It maintains robustness under aggressive token reduction.

03

Demonstrates strong generalization across multiple domains.

Abstract

State Space Models (SSMs) have emerged as powerful architectures in computer vision, yet improving their computational efficiency remains crucial for practical and scalable deployment.While token reduction serves as an effective approach for model efficiency, applying it to SSMs requires careful consideration of their unique sequential modeling capabilities.In this work, we propose MaMe, a token-merging strategy tailored for SSM-based vision models.MaMe addresses two key challenges: quantifying token importance and preserving sequential properties. Our approach leverages the state transition parameter $Δ$ as an informativeness measure and introduces strategic token arrangements to preserve sequential information flow.Extensive experiments demonstrate that MaMe achieves superior efficiency-performance trade-offs for both fine-tuned and off-the-shelf models. Particularly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.