Rethinking Memory Design in SAM-Based Visual Object Tracking
Mohamad Alansari, Muzammal Naseer, Hasan Al Marzouqi, Naoufel Werghi, and Sajid Javed

TL;DR
This paper systematically analyzes memory mechanisms in SAM-based visual object tracking, proposing a unified hybrid memory framework that enhances robustness across diverse challenging scenarios and transferability to next-generation models.
Contribution
It provides a comprehensive study of memory design principles in SAM-based tracking and introduces a modular hybrid memory framework that improves performance and robustness.
Findings
Memory mechanisms mainly differ in short-term frame selection.
Unified hybrid memory improves robustness under occlusion and distractors.
Framework transfers effectively to SAM3 backbone.
Abstract
\noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Gaze Tracking and Assistive Technology · Human Pose and Action Recognition
