Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado; Yi Li; Nuno Vasconcelos

arXiv:2011.01819·cs.CV·November 4, 2020·22 cites

Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado, Yi Li, Nuno Vasconcelos

PDF

Open Access 1 Video

TL;DR

This paper proposes a self-supervised learning method that leverages spatial cues in 360-degree video and spatial audio to improve audio-visual representation learning, demonstrating benefits across multiple downstream tasks.

Contribution

It introduces a novel contrastive spatial alignment task using transformer-based reasoning over full spatial content, capturing spatial cues ignored by prior methods.

Findings

01

Improved performance on audio-visual correspondence tasks

02

Enhanced spatial alignment accuracy in downstream evaluations

03

Better action recognition and video segmentation results

Abstract

We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360{\deg} video and spatial audio. The ability to perform spatial alignment is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning Representations from Audio-Visual Spatial Alignment· slideslive

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization