AV-RIR: Audio-Visual Room Impulse Response Estimation
Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh, Manocha

TL;DR
AV-RIR introduces a multi-modal neural approach that combines audio and visual cues to accurately estimate room impulse responses, significantly improving speech dereverberation and environmental modeling for AR/VR applications.
Contribution
It presents a novel neural codec-based architecture with multi-task learning and new visual feature augmentation for superior RIR estimation from audio-visual data.
Findings
Outperforms previous methods by 36-63% in RIR estimation metrics.
Improves late reverberation component accuracy via image-to-RIR retrieval by 86%.
Achieves higher human preference scores and competitive speech dereverberation performance.
Abstract
Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Indoor and Outdoor Localization Technologies
