PreViTS: Contrastive Pretraining with Video Tracking Supervision

Brian Chen; Ramprasaath R. Selvaraju; Shih-Fu Chang; Juan Carlos; Niebles; and Nikhil Naik

arXiv:2112.00804·cs.CV·September 30, 2022

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Brian Chen, Ramprasaath R. Selvaraju, Shih-Fu Chang, Juan Carlos, Niebles, and Nikhil Naik

PDF

Open Access 1 Video

TL;DR

PreViTS introduces an unsupervised tracking-based method for self-supervised video representation learning, improving the utilization of temporal object transformations and spatial object localization, leading to state-of-the-art action classification results.

Contribution

It proposes a novel SSL framework that leverages unsupervised tracking signals to select relevant clips and guide spatial focus, enhancing video representation learning.

Findings

01

Outperforms contrastive learning alone on downstream tasks.

02

Achieves state-of-the-art action classification accuracy.

03

Produces more robust features to background and context changes.

Abstract

Videos are a rich source for self-supervised learning (SSL) of visual representations due to the presence of natural temporal transformations of objects. However, current methods typically randomly sample video clips for learning, which results in an imperfect supervisory signal. In this work, we propose PreViTS, an SSL framework that utilizes an unsupervised tracking signal for selecting clips containing the same object, which helps better utilize temporal transformations of objects. PreViTS further uses the tracking signal to spatially constrain the frame regions to learn from and trains the model to locate meaningful objects by providing supervision on Grad-CAM attention maps. To evaluate our approach, we train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS. Training with PreViTS outperforms representations learnt by contrastive strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PreViTS: Contrastive Pretraining with Video Tracking Supervision· youtube

Taxonomy

TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsInfoNCE · Batch Normalization · Momentum Contrast