Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing

Zhiyu Li; Xinwen Yue; Shenghui Zhao; Jing Wang

arXiv:2604.05545·eess.AS·April 8, 2026

Multimodal Deep Learning Method for Real-Time Spatial Room Impulse Response Computing

Zhiyu Li, Xinwen Yue, Shenghui Zhao, Jing Wang

PDF

TL;DR

This paper introduces a multimodal deep learning approach for real-time spatial room impulse response generation in VR, combining scene data and waveforms to improve auditory scene rendering.

Contribution

The work presents a novel multimodal deep learning model that efficiently predicts SRIRs using scene geometry and waveforms, reducing computational load for VR audio.

Findings

01

Model outperforms existing methods in accuracy and speed

02

Constructed a diverse, new dataset for training and evaluation

03

Achieved real-time SRIR generation suitable for VR applications

Abstract

We propose a multimodal deep learning model for VR auralization that generates spatial room impulse responses (SRIRs) in real time to reconstruct scene-specific auditory perception. Employing SRIRs as the output reduces computational complexity and facilitates integration with personalized head-related transfer functions. The model takes two modalities as input: scene information and waveforms, where the waveform corresponds to the low-order reflections (LoR). LoR can be efficiently computed using geometrical acoustics (GA) but remains difficult for deep learning models to predict accurately. Scene geometry, acoustic properties, source coordinates, and listener coordinates are first used to compute LoR in real time via GA, and both LoR and these features are subsequently provided as inputs to the model. A new dataset was constructed, consisting of multiple scenes and their corresponding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.