Depth Infused Binaural Audio Generation using Hierarchical Cross-Modal Attention
Kranti Kumar Parida, Siddharth Srivastava, Neeraj Matiyali, Gaurav, Sharma

TL;DR
This paper introduces a hierarchical cross-modal attention model that leverages image and depth features to enhance binaural audio generation from mono recordings, improving immersion in AR/VR applications.
Contribution
The work proposes a novel encoder-decoder architecture that incorporates depth information with image features using hierarchical attention for better binaural audio synthesis.
Findings
Adding depth features improves audio quality both qualitatively and quantitatively.
The hierarchical attention mechanism effectively fuses visual and depth cues with audio features.
The approach outperforms previous mono-to-binaural conversion methods.
Abstract
Binaural audio gives the listener the feeling of being in the recording place and enhances the immersive experience if coupled with AR/VR. But the problem with binaural audio recording is that it requires a specialized setup which is not possible to fabricate within handheld devices as compared to traditional mono audio that can be recorded with a single microphone. In order to overcome this drawback, prior works have tried to uplift the mono recorded audio to binaural audio as a post processing step conditioning on the visual input. But all the prior approaches missed other most important information required for the task, i.e. distance of different sound producing objects from the recording setup. In this work, we argue that the depth map of the scene can act as a proxy for encoding distance information of objects in the scene and show that adding depth features along with image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Digital Media Forensic Detection
