Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval
Jing Zhang, Zhikai Li, Xuewen Liu, Qingyi Gu

TL;DR
Efficient-SAM2 enhances SAM2's video object segmentation by focusing computation on salient object regions using object-aware routing and memory retrieval, achieving significant speedup with minimal accuracy loss.
Contribution
The paper introduces Efficient-SAM2, a post-training acceleration method that adaptively reduces redundant computation by leveraging object saliency and temporal consistency.
Findings
Achieves 1.68x speedup on SAM2.1-L model.
Only 1.0% accuracy drop on SA-V test set.
Reduces unnecessary background computation effectively.
Abstract
Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper makes a solid technical contribution by streamlining the SAM2 model. The object-aware pruning of the image encoder and the introduction of a background shortcut for non-foreground patches are both clever ideas that substantially reduce computation. 2. The ablation study is comprehensive. It not only analyzes the proposed components in isolation but also integrates other efficient methods (e.g., ToME) into their framework for comparison, which provides valuable insights.
1. The proposed routing mechanism heavily relies on the assumption of temporal consistency in video streams, meaning that no significant camera shaking or viewpoint shift occurs. This limits the method’s applicability in real-world scenarios with dynamic motion. It would be interesting to see comparisons with SAM2 on datasets such as MOSEv2[1] and SeCVOS[2], which feature frequent viewpoint transitions. 2. Performing grid search on only one benchmark is not sufficient to demonstrate robustne
1. The motivation for reducing SAM2's computational overhead is well-grounded and intuitive for video object segmentation 2. The post-training approach is practical, enabling efficient adaptation by leveraging the generalized parameters of the pre-trained SAM2. 3. The method achieves a good speed-performance trade-off, delivering a speedup of nearly 2x while incurring only a minimal and acceptable performance degradation of approximately 1%.
1. The SWR component is heavily dependent on the previous frame's prediction and salient mask. I think this may cause challenges in some cases, such as rapid motion, abrupt scene cuts, or severe occlusions, where this temporal assumption would be violated. 2. For a video domain paper, the qualitative results with static images are insufficient. Supplemental videos would be significantly stronger to properly demonstrate temporal consistency, failure modes (especially in scenarios mentioned in poi
1. Efficient-SAM2 avoids expensive full-model retraining. It adds negligible parameters and low training overhead, making it flexible for low computational deployment. 2. By aligning with SAM2’s natural sparse perception, it eliminates redundancy without compromising core functionality—unlike generic token-merging methods (e.g., ToMe) that cause severe accuracy drops. 3. SWR (image encoder) and SMR (memory attention) are independent modules, allowing separate optimization or integration with
1. The claimed contribution 1 should be merge with contribution 2 as whole one. 2. SWR relies on hyperparameters like the prediction confidence threshold (θₒᵦⱼ=0.5) and saliency threshold (τ=0.7), while SMR depends on the sparsity ratio (s=0.95). The paper does not explore how these parameters generalize to edge cases (e.g., highly cluttered scenes, fast-moving objects) or different datasets. 3. The variable symbols cause confusions, especially for different A. 4. The codes released as sup
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Image Enhancement Techniques
