Learning Representations from Audio-Visual Spatial Alignment
Pedro Morgado, Yi Li, Nuno Vasconcelos

TL;DR
This paper proposes a self-supervised learning method that leverages spatial cues in 360-degree video and spatial audio to improve audio-visual representation learning, demonstrating benefits across multiple downstream tasks.
Contribution
It introduces a novel contrastive spatial alignment task using transformer-based reasoning over full spatial content, capturing spatial cues ignored by prior methods.
Findings
Improved performance on audio-visual correspondence tasks
Enhanced spatial alignment accuracy in downstream evaluations
Better action recognition and video segmentation results
Abstract
We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360{\deg} video and spatial audio. The ability to perform spatial alignment is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
