Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing
Zhiyu Li, Xinwen Yue, Shenghui Zhao, Jing Wang

TL;DR
This paper introduces a multimodal deep learning approach for real-time spatial room impulse response generation in VR, combining scene data and waveforms to improve auditory scene rendering.
Contribution
The work presents a novel multimodal deep learning model that efficiently predicts SRIRs using scene geometry and waveforms, reducing computational load for VR audio.
Findings
Model outperforms existing methods in accuracy and speed
Constructed a diverse, new dataset for training and evaluation
Achieved real-time SRIR generation suitable for VR applications
Abstract
We propose a multimodal deep learning model for VR auralization that generates spatial room impulse responses (SRIRs) in real time to reconstruct scene-specific auditory perception. Employing SRIRs as the output reduces computational complexity and facilitates integration with personalized head-related transfer functions. The model takes two modalities as input: scene information and waveforms, where the waveform corresponds to the low-order reflections (LoR). LoR can be efficiently computed using geometrical acoustics (GA) but remains difficult for deep learning models to predict accurately. Scene geometry, acoustic properties, source coordinates, and listener coordinates are first used to compute LoR in real time via GA, and both LoR and these features are subsequently provided as inputs to the model. A new dataset was constructed, consisting of multiple scenes and their corresponding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
