TL;DR
SAMOSA enhances foundation model-based visual object tracking by explicitly modeling motion, geometry, and semantics, leading to improved robustness and generalization in complex nonlinear scenarios.
Contribution
It introduces a novel framework that adapts SAM 2 for tracking by incorporating motion prediction, semantic cues, and geometric constraints.
Findings
Outperforms state-of-the-art SAM 2-based methods on benchmarks.
Demonstrates stronger generalization than supervised methods.
Achieves significant gains on anti-UAV datasets.
Abstract
Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video understanding priors from large-scale pretraining and offer a promising foundation for building more robust and generalizable trackers. However, directly applying SAM 2 to VOT remains suboptimal, as it does not explicitly model target motion dynamics or enforce geometric and semantic consistency across frames, both of which are essential for reliable tracking. To address this issue, we propose SAMOSA, a new tracking framework that adapts SAM 2 to complex VOT scenarios by explicitly leveraging motion, geometry, and semantic cues. Specifically, we introduce a lightweight nonlinear motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
