Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi; Sagnik Majumder; Kristen Grauman

arXiv:2512.12165·cs.CV·May 13, 2026

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi, Sagnik Majumder, Kristen Grauman

PDF

1 Repo

TL;DR

This paper demonstrates that passive scene sounds can enhance camera pose estimation in videos, especially under visual degradation, by integrating audio cues into existing vision-based models.

Contribution

It introduces a novel audio-visual framework that combines DOA spectra and binaural embeddings with vision models for improved pose estimation in real-world videos.

Findings

01

Audio cues improve pose estimation accuracy over visual-only methods.

02

The approach is robust to visual corruption like motion blur or occlusions.

03

First successful use of audio signals for camera pose estimation in in-the-wild videos.

Abstract

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

http://vision.cs.utexas.edu/projects/av_camera_pose
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.