Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking

Yaozong Zheng; Bineng Zhong; Qihua Liang; Ning Li; Shuxiang Song

arXiv:2507.21606·cs.CV·July 30, 2025

Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking

Yaozong Zheng, Bineng Zhong, Qihua Liang, Ning Li, Shuxiang Song

PDF

TL;DR

This paper introduces racker, a self-supervised tracking framework that learns from unlabeled videos by decoupling spatial and temporal consistency, significantly reducing the need for manual annotations while achieving state-of-the-art results.

Contribution

The paper proposes a novel decoupled spatio-temporal consistency training framework and an instance contrastive loss for self-supervised visual tracking, eliminating the need for box annotations.

Findings

01

Outperforms state-of-the-art self-supervised tracking methods.

02

Achieves over 25% improvement in AUC on GOT10K.

03

Demonstrates strong generalization across nine benchmark datasets.

Abstract

The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf{{\tracker}}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.