TL;DR
This paper introduces a novel deep learning approach that combines RGB images, binaural echoes, and material properties to significantly improve depth prediction accuracy, inspired by animal echolocation, and demonstrates its effectiveness on multiple datasets.
Contribution
The paper presents a new multi modal fusion technique explicitly incorporating material properties, leading to improved depth estimation from audio-visual data compared to existing methods.
Findings
28% RMSE improvement over state-of-the-art
Effective on Replica and Matterport3D datasets
Comprehensive ablation and qualitative analysis
Abstract
We address the problem of estimating depth with multi modal audio visual data. Inspired by the ability of animals, such as bats and dolphins, to infer distance of objects with echolocation, some recent methods have utilized echoes for depth estimation. We propose an end-to-end deep learning based pipeline utilizing RGB images, binaural echoes and estimated material properties of various objects within a scene. We argue that the relation between image, echoes and depth, for different scene elements, is greatly influenced by the properties of those elements, and a method designed to leverage this information can lead to significantly improved depth estimation from audio visual inputs. We propose a novel multi modal fusion technique, which incorporates the material properties explicitly while combining audio (echoes) and visual modalities to predict the scene depth. We show empirically,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
