TL;DR
SAMITE enhances visual object tracking by integrating a memory bank and positional prompts into SAM2, effectively handling occlusions and distractions, and improving accuracy and robustness across multiple benchmarks.
Contribution
The paper introduces SAMITE, a novel VOT model that incorporates a prototypical memory bank and positional prompt generator to improve tracking accuracy and error interception.
Findings
Outperforms existing methods on six benchmarks.
Effectively handles occlusions and distractors.
Reduces error propagation in tracking.
Abstract
Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos. Existing methods can be roughly categorized into template matching and autoregressive methods, where the former usually neglects the temporal dependencies across frames and the latter tends to get biased towards the object categories during training, showing weak generalizability to unseen classes. To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner. Nevertheless, existing methods fail to overcome the challenges of object occlusions and distractions, and do not have any measures to intercept the propagation of tracking errors. To tackle them, we present a SAMITE model, built upon SAM2 with…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper pinpoints a real challenge in SAM2-style, memory-based VOT. Once a bad frame is written to memory (due to occlusion or a look-alike distractor), the error keeps getting reused. 2. The paper proposes two modules, the Prototypical Memory Bank (PMB) and Positional Prompt Generator (PPG). These two modules are conceptually simple and slot on top of SAM2 without retraining to tackle the problem of error propagation 3. The authors do report results on six standard VOT benchmarks (LaSOT, L
1. They themselves say inference is slower than SAM2.1 because PMB/PPG add work. For a tracker, that’s serious. They have 9.2 FPS with a base backbone with 78.9 AO in GOT-10k, which is less than SAMURAI by 0.7% which has a higher FPS of 14.3 FPS. 2. SAMURI is not shown in Table 2, which probably indicate that its performance is better than SAMITE. If this is the case, this means that SAMURI outperforms SAMITE in 4 datasets out of 6. As shown in the paper, all ablation studies are done on LaSOT a
1. Clear motivation addressing occlusion and distraction failure modes and visual examples. 2. Achieve state-of-the-art performance across six standard VOT benchmarks with several metrics. 3. Good ablation study (Table 3) shows the additive contribution of proposed modules (PMB, PPG, CCC).
1. Limited novelty: The proposed components, such as the prototypical memory bank and prior-guided mask generation, are largely extensions or combinations of existing ideas from prior works (e.g., SAM2-based memory selection, AENet prior masks), offering only incremental technical contributions. 2. Missing runtime analysis: The paper does not report runtime or computational overhead compared to SAM2 or related trackers, leaving efficiency and practicality unclear. 3. Missing qualitative result
- Paper is well written and easy to follow, especially Figure 2 illustrate the framework very clearly. - The idea of decoupling foreground and background feature make sense to me. - Performance is good compared to existing SAM2 variants across multiple benchmarks.
- The method should be applicable to the VOS task, not just VOT. The work should include more results on VOS benchmarks to demonstrate its generalizability. - The zero-shot claim for SAMITE is not very convincing. Since SAM2 itself is trained on a large-scale VOS dataset, adapting it to the simpler VOT task should not be considered zero-shot. - The proposed method appears to maintain a better memory bank due to the designed memory selection strategy. However, I believe there is still a signifi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
