StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

Duy M. H. Nguyen; Tuan A. Tran; Duong Nguyen; Siwei Xie; Trung Q. Nguyen; Mai T. N. Truong; Daniel Palenicek; An T. Le; Michael Barz; TrungTin Nguyen; Tuan Dam; Ngan Le; Minh Vu; Khoa Doan; Vien Ngo; Pengtao Xie; James Zou; Daniel Sonntag; Jan Peters; Mathias Niepert

arXiv:2603.07307·cs.CV·March 10, 2026

StructSAM: Structure- and Spectrum-Preserving Token Merging for Segment Anything Models

Duy M. H. Nguyen, Tuan A. Tran, Duong Nguyen, Siwei Xie, Trung Q. Nguyen, Mai T. N. Truong, Daniel Palenicek, An T. Le, Michael Barz, TrungTin Nguyen, Tuan Dam, Ngan Le, Minh Vu, Khoa Doan, Vien Ngo, Pengtao Xie, James Zou, Daniel Sonntag, Jan Peters, Mathias Niepert

PDF

Open Access

TL;DR

StructSAM introduces a novel token merging framework for Segment Anything Models that preserves boundaries and prompts, significantly reducing computational cost while maintaining segmentation accuracy across various benchmarks.

Contribution

The paper presents StructSAM, a resolution-preserving, gradient-based token merging method tailored for SAM, improving efficiency without sacrificing boundary and prompt integrity.

Findings

01

Reduces encoder FLOPs by 25-30% on average

02

Maintains competitive segmentation accuracy across benchmarks

03

Outperforms existing token merging methods in efficiency and boundary preservation

Abstract

Recent token merging techniques for Vision Transformers (ViTs) provide substantial speedups by reducing the number of tokens processed by self-attention, often without retraining. However, their direct application to the Segment Anything Model (SAM) family is nontrivial: SAM's image encoder mixes windowed and global attention, and its mask decoder relies on dense, prompt-conditioned features for precise boundary prediction. We systematically evaluate representative token-merging methods on SAM and Medical SAM in a strict off-the-shelf setting, and find that existing destination-selection heuristics can erode boundaries and leak prompt information as merge rates increase. We propose \textbf{StructSAM}, a resolution-preserving merge-unmerge framework tailored to SAM. StructSAM computes a lightweight token-energy score from first-order feature gradients, uses grid-based flatness screening…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Adversarial Robustness in Machine Learning