Fast SAM2 with Text-Driven Token Pruning

Avilasha Mandal; Chaoning Zhang; Fachrina Dewi Puspitasari; Xudong Wang; Jiaquan Zhang; Caiyan Qin; Guoqing Wang; Yang Yang; Heng Tao Shen

arXiv:2512.21333·cs.CV·December 25, 2025

Fast SAM2 with Text-Driven Token Pruning

Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen

PDF

Open Access

TL;DR

This paper introduces a text-guided token pruning method for SAM2 that significantly reduces computational costs in video segmentation by selectively removing less relevant tokens, maintaining accuracy while improving efficiency.

Contribution

The proposed framework is the first to incorporate text-driven token pruning into SAM2, enhancing inference speed and memory efficiency without altering the core segmentation architecture.

Findings

01

Achieves up to 42.50% faster inference

02

Reduces GPU memory usage by 37.41%

03

Maintains competitive segmentation performance

Abstract

Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Multimodal Machine Learning Applications