Self-supervised Learning of Audio Representations from Audio-Visual Data   using Spatial Alignment

Shanshan Wang; Archontis Politis; Annamaria Mesaros; Tuomas Virtanen

arXiv:2206.00970·eess.AS·November 23, 2022

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Shanshan Wang, Archontis Politis, Annamaria Mesaros, Tuomas Virtanen

PDF

Open Access

TL;DR

This paper introduces AVSA, a self-supervised learning method that leverages spatial alignment between audio and visual data in 360° videos to improve audio representation and downstream task performance.

Contribution

The work proposes a novel AVSA task that incorporates spatial information and object detection to enhance audio-visual representation learning from 360° videos.

Findings

01

10% improvement with FOA-IV features over log-mel spectrograms

02

Object-oriented crops boost human action recognition accuracy

03

Achieves state-of-the-art results on acoustic scene classification

Abstract

Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for self-supervised representation learning based on audio-visual spatial alignment (AVSA), a more sophisticated alignment task than the audio-visual correspondence (AVC). In addition to the correspondence, AVSA also learns from the spatial location of acoustic and visual content. Based on 360 $^{o}$ video and Ambisonics audio, we propose selection of visual objects using object detection, and beamforming of the audio signal towards the detected objects, attempting to learn the spatial alignment between objects and the sound they produce. We investigate the use of spatial audio features to represent the audio input, and different audio formats: Ambisonics,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation