Spatial-then-Temporal Self-Supervised Learning for Video Correspondence
Rui Li, Dong Liu

TL;DR
This paper introduces a novel self-supervised learning approach that combines spatial and temporal cues to improve video correspondence representations, outperforming existing methods.
Contribution
It proposes a spatial-then-temporal self-supervised learning framework with correlation distillation losses, enhancing video correspondence tasks.
Findings
Outperforms state-of-the-art self-supervised methods
Effective in various correspondence-based video analysis tasks
Ablation studies confirm the benefits of the two-step design
Abstract
In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Video Surveillance and Tracking Methods · Image Retrieval and Classification Techniques
MethodsContrastive Learning
