Labelling unlabelled videos from scratch with multi-modal   self-supervision

Yuki M. Asano; Mandela Patrick; Christian Rupprecht; Andrea Vedaldi

arXiv:2006.13662·cs.CV·March 2, 2021·71 cites

Labelling unlabelled videos from scratch with multi-modal self-supervision

Yuki M. Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel multi-modal self-supervised clustering method for pseudo-labeling unlabelled videos by leveraging audio-visual correspondence, enabling semantic clustering without human annotations and providing benchmark results on standard datasets.

Contribution

It presents the first method for unsupervised video labelling using multi-modal self-supervision and introduces benchmark results for this task.

Findings

01

Clusters have high semantic overlap with ground truth labels

02

Unsupervised labelling does not emerge naturally from strong feature encoders

03

Benchmark results on Kinetics, Kinetics-Sound, VGG-Sound, and AVE datasets

Abstract

A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/selavi
pytorchOfficial

Videos

Labelling unlabelled videos from scratch with multi-modal self-supervision· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis