Interpretable Perception and Reasoning for Audiovisual Geolocation
Yiyang Su, Xiaoming Liu

TL;DR
This paper introduces a novel audiovisual geolocation framework that combines interpretable perception and reasoning to improve global localization accuracy using multimodal cues, supported by a new large-scale video benchmark.
Contribution
It presents AVG, a comprehensive audiovisual geolocation benchmark, and a three-stage framework integrating perception, reasoning, and prediction for enhanced localization.
Findings
Framework significantly outperforms unimodal baselines.
Interpretable soundscape perception provides critical localization signals.
Multimodal reasoning with audio and visual cues achieves high-precision geolocation.
Abstract
While recent advances in Multimodal Large Language Models (MLLMs) have improved image-based localization, precise global geolocation remains a formidable challenge due to the inherent ambiguity of visual landscapes and the largely untapped potential of auditory cues. In this paper, we introduce Audiovisual Geolocation, a framework designed to resolve geographic ambiguity through interpretable perception and reasoning. We present AVG, a high-quality global-scale video benchmark for geolocation, comprising 20,000 curated clips across 1,000 distinct locations. To address the complexity of audiovisual geolocation, we propose a three-stage framework: (1) a Perception stage that utilizes a mixture-autoregressive sparse autoencoder to decompose noisy audio into semantically grounded "acoustic atoms"; (2) a Multimodal Reasoning stage that employs an MLLM finetuned via Group Relative Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multimodal Machine Learning Applications · Music and Audio Processing
