Fast SAM2 with Text-Driven Token Pruning
Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen

TL;DR
This paper introduces a text-guided token pruning method for SAM2 that significantly reduces computational costs in video segmentation by selectively removing less relevant tokens, maintaining accuracy while improving efficiency.
Contribution
The proposed framework is the first to incorporate text-driven token pruning into SAM2, enhancing inference speed and memory efficiency without altering the core segmentation architecture.
Findings
Achieves up to 42.50% faster inference
Reduces GPU memory usage by 37.41%
Maintains competitive segmentation performance
Abstract
Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Multimodal Machine Learning Applications
