Points2Sound: From mono to binaural audio using 3D point cloud scenes
Francesc Llu\'is, Vasileios Chatziioannou, Alex Hofmann

TL;DR
Points2Sound is a deep learning model that converts mono audio into binaural audio by leveraging 3D point cloud visual data, enhancing immersive virtual experiences.
Contribution
This work introduces a novel multi-modal deep learning approach that uses 3D point cloud scenes to guide binaural audio synthesis from mono signals, extending previous 2D visual guidance methods.
Findings
3D visual information effectively guides binaural synthesis.
Model performance varies with scene attributes and reverberation.
Multiple mono signals and source counts impact synthesis quality.
Abstract
For immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. We propose Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network and an audio network. The vision network uses 3D sparse convolutions to extract a visual feature from the point cloud scene. Then, the visual feature conditions the audio network, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Loss and Rehabilitation · Speech and Audio Processing · Acoustic Wave Phenomena Research
MethodsSparse Convolutions
