Labelling unlabelled videos from scratch with multi-modal self-supervision
Yuki M. Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi

TL;DR
This paper introduces a novel multi-modal self-supervised clustering method for pseudo-labeling unlabelled videos by leveraging audio-visual correspondence, enabling semantic clustering without human annotations and providing benchmark results on standard datasets.
Contribution
It presents the first method for unsupervised video labelling using multi-modal self-supervision and introduces benchmark results for this task.
Findings
Clusters have high semantic overlap with ground truth labels
Unsupervised labelling does not emerge naturally from strong feature encoders
Benchmark results on Kinetics, Kinetics-Sound, VGG-Sound, and AVE datasets
Abstract
A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
