PreViTS: Contrastive Pretraining with Video Tracking Supervision
Brian Chen, Ramprasaath R. Selvaraju, Shih-Fu Chang, Juan Carlos, Niebles, and Nikhil Naik

TL;DR
PreViTS introduces an unsupervised tracking-based method for self-supervised video representation learning, improving the utilization of temporal object transformations and spatial object localization, leading to state-of-the-art action classification results.
Contribution
It proposes a novel SSL framework that leverages unsupervised tracking signals to select relevant clips and guide spatial focus, enhancing video representation learning.
Findings
Outperforms contrastive learning alone on downstream tasks.
Achieves state-of-the-art action classification accuracy.
Produces more robust features to background and context changes.
Abstract
Videos are a rich source for self-supervised learning (SSL) of visual representations due to the presence of natural temporal transformations of objects. However, current methods typically randomly sample video clips for learning, which results in an imperfect supervisory signal. In this work, we propose PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects. PreViTS further uses the tracking signal to spatially constrain the frame regions to learn from and trains the model to locate meaningful objects by providing supervision on Grad-CAM attention maps. To evaluate our approach, we train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS. Training with PreViTS outperforms representations learnt by contrastive strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
PreViTS: Contrastive Pretraining with Video Tracking Supervision· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsInfoNCE · Batch Normalization · Momentum Contrast
