TL;DR
Geo2Sound introduces a scalable framework that generates realistic soundscapes from satellite imagery by combining geospatial attributes, semantic hypotheses, and geo-acoustic alignment, validated on a new large-scale benchmark.
Contribution
It presents a novel task and framework for satellite-to-soundscape generation, along with the first large-scale benchmark dataset for this purpose.
Findings
Geo2Sound achieves a state-of-the-art FAD of 1.765, outperforming baselines by 50%.
Human evaluations show 26.5% improvement in realism and semantic alignment.
The framework effectively models geographic and acoustic correlations for soundscape synthesis.
Abstract
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable source for global soundscape generation, matching these views to real acoustic environments with unique spatial structures is inherently difficult. To address this challenge, we introduce Geo2Sound, a novel task and framework for generating geographically realistic soundscapes from satellite imagery. Specifically, Geo2Sound combines structural geospatial attributes modeling, semantic hypothesis expansion, and geo-acoustic alignment in a unified framework. A lightweight classifier summarizes overhead scenes into compact geographic attributes, multiple sound-oriented semantic hypotheses are used to generate diverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
