Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment
Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen

TL;DR
This paper introduces AVSA, a self-supervised learning method that leverages spatial alignment between audio and visual data in 360° videos to improve audio representation and downstream task performance.
Contribution
The work proposes a novel AVSA task that incorporates spatial information and object detection to enhance audio-visual representation learning from 360° videos.
Findings
10% improvement with FOA-IV features over log-mel spectrograms
Object-oriented crops boost human action recognition accuracy
Achieves state-of-the-art results on acoustic scene classification
Abstract
Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360 video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
