TL;DR
This paper introduces a self-supervised learning framework leveraging hierarchical video invariances to learn transferable visual representations, achieving state-of-the-art results on multiple downstream tasks with minimal labeled data.
Contribution
The paper presents a novel holistic self-supervised framework that captures multiple levels of video invariances, improving transfer learning performance with less labeled data.
Findings
State-of-the-art transfer learning results on VTAB with only 1000 labels per task.
Outperforms ImageNet-pretrained ResNet-50 with 10x fewer labeled images.
Surpasses previous supervised models using full ImageNet data.
Abstract
We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Self-Supervised Learning of Video-Induced Visual Invariances· youtube
