Self-Supervised Learning of Audio-Visual Objects from Video

Triantafyllos Afouras; Andrew Owens; Joon Son Chung; Andrew Zisserman

arXiv:2008.04237·cs.CV·August 11, 2020

Self-Supervised Learning of Audio-Visual Objects from Video

Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman

PDF

1 Repo

TL;DR

This paper introduces a self-supervised model that learns to identify and group audio-visual objects from videos, enabling multiple speech-related tasks without labeled data and outperforming existing methods.

Contribution

The paper presents a novel self-supervised approach using attention and optical flow to localize and group sound sources in videos, applicable to diverse speakers including non-human entities.

Findings

01

Outperforms other self-supervised methods in audio-visual tasks.

02

Achieves performance comparable to supervised face detection methods.

03

Successfully applied to non-human speakers like cartoons and puppets.

Abstract

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets.Our model significantly outperforms other self-supervised approaches, and obtains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

afourast/avobjects
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.