TL;DR
This paper demonstrates that passive scene sounds can enhance camera pose estimation in videos, especially under visual degradation, by integrating audio cues into existing vision-based models.
Contribution
It introduces a novel audio-visual framework that combines DOA spectra and binaural embeddings with vision models for improved pose estimation in real-world videos.
Findings
Audio cues improve pose estimation accuracy over visual-only methods.
The approach is robust to visual corruption like motion blur or occlusions.
First successful use of audio signals for camera pose estimation in in-the-wild videos.
Abstract
Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
