Visual-Assisted Sound Source Depth Estimation in the Wild
Wei Sun, Lili Qiu

TL;DR
This paper introduces FBDepth, a novel audio-visual framework that leverages the difference in light and sound travel times to accurately estimate the depth of sound sources in diverse, real-world scenarios.
Contribution
FBDepth is the first method to combine video and audio cues with semantic and spatial features for large-range depth estimation, inspired by the flash-to-bang phenomenon.
Findings
Reduces Absolute Relative error by 55% compared to RGB-only methods.
Successfully estimates depth up to 50 meters in real-world videos.
Utilizes a mobile phone dataset with over 3000 clips across 20 objects.
Abstract
Depth estimation enables a wide variety of 3D applications, such as robotics, autonomous driving, and virtual reality. Despite significant work in this area, it remains open how to enable accurate, low-cost, high-resolution, and large-range depth estimation. Inspired by the flash-to-bang phenomenon (i.e. hearing the thunder after seeing the lightning), this paper develops FBDepth, the first audio-visual depth estimation framework. It takes the difference between the time-of-flight (ToF) of the light and the sound to infer the sound source depth. FBDepth is the first to incorporate video and audio with both semantic features and spatial hints for range estimation. It first aligns correspondence between the video track and audio track to locate the target object and target sound in a coarse granularity. Based on the observation of moving objects' trajectories, FBDepth proposes to estimate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Enhancement Techniques · Video Analysis and Summarization
