Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking
Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song

TL;DR
This paper introduces racker, a self-supervised tracking framework that learns from unlabeled videos by decoupling spatial and temporal consistency, significantly reducing the need for manual annotations while achieving state-of-the-art results.
Contribution
The paper proposes a novel decoupled spatio-temporal consistency training framework and an instance contrastive loss for self-supervised visual tracking, eliminating the need for box annotations.
Findings
Outperforms state-of-the-art self-supervised tracking methods.
Achieves over 25% improvement in AUC on GOT10K.
Demonstrates strong generalization across nine benchmark datasets.
Abstract
The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf{{\tracker}}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
