Self-Supervised Learning of Video-Induced Visual Invariances

Michael Tschannen; Josip Djolonga; Marvin Ritter; Aravindh Mahendran,; Xiaohua Zhai; Neil Houlsby; Sylvain Gelly; Mario Lucic

arXiv:1912.02783·cs.CV·April 3, 2020

Self-Supervised Learning of Video-Induced Visual Invariances

Michael Tschannen, Josip Djolonga, Marvin Ritter, Aravindh Mahendran,, Xiaohua Zhai, Neil Houlsby, Sylvain Gelly, Mario Lucic

PDF

1 Video

TL;DR

This paper introduces a self-supervised learning framework leveraging hierarchical video invariances to learn transferable visual representations, achieving state-of-the-art results on multiple downstream tasks with minimal labeled data.

Contribution

The paper presents a novel holistic self-supervised framework that captures multiple levels of video invariances, improving transfer learning performance with less labeled data.

Findings

01

State-of-the-art transfer learning results on VTAB with only 1000 labels per task.

02

Outperforms ImageNet-pretrained ResNet-50 with 10x fewer labeled images.

03

Surpasses previous supervised models using full ImageNet data.

Abstract

We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-Supervised Learning of Video-Induced Visual Invariances· youtube