Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning
Yaozong Zheng, Qihua Liang, Bineng Zhong, Shuimu Zeng, Yuanliang Xue, Ning Li, Shuxiang Song

TL;DR
This paper introduces racker, a self-supervised tracking framework that uses dual-modal context association with semantic prompts and noise to learn robust representations from unlabeled videos.
Contribution
It proposes a novel dual-modal context association mechanism with a two-stage training process, enhancing self-supervised tracking by leveraging semantic prompts and noise.
Findings
Outperforms existing self-supervised tracking methods in experiments.
Effectively learns robust tracking representations from unlabeled videos.
Maintains efficient inference by applying context association only during training.
Abstract
Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
