Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with   Depth and Cross Modal Attention

Kranti Kumar Parida; Siddharth Srivastava; Gaurav Sharma

arXiv:2111.08046·cs.CV·November 17, 2021

Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

Kranti Kumar Parida, Siddharth Srivastava, Gaurav Sharma

PDF

Open Access

TL;DR

This paper introduces a novel deep learning approach that uses depth maps and cross-modal attention to convert mono audio into binaural audio, enhancing immersive experiences in AR/VR without specialized recording setups.

Contribution

It proposes a new encoder-decoder architecture with hierarchical attention leveraging image, depth, and audio features, outperforming existing methods on public datasets.

Findings

01

Outperforms state-of-the-art methods on FAIR-Play and MUSIC-Stereo datasets.

02

Utilizes depth maps as a key cue for distance information in binauralization.

03

Qualitative results show the model focuses on relevant scene information.

Abstract

Binaural audio gives the listener an immersive experience and can enhance augmented and virtual reality. However, recording binaural audio requires specialized setup with a dummy human head having microphones in left and right ears. Such a recording setup is difficult to build and setup, therefore mono audio has become the preferred choice in common devices. To obtain the same impact as binaural audio, recent efforts have been directed towards lifting mono audio to binaural audio conditioned on the visual input from the scene. Such approaches have not used an important cue for the task: the distance of different sound producing objects from the microphones. In this work, we argue that depth map of the scene can act as a proxy for inducing distance information of different objects in the scene, for the task of audio binauralization. We propose a novel encoder-decoder architecture with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing