SounDiT: Geo-Contextual Soundscape-to-Landscape Generation
Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, Yuhao Kang

TL;DR
This paper introduces GeoS2L, a new task for generating geographically realistic landscape images from environmental soundscapes, supported by large datasets and a diffusion transformer model, with an evaluation framework for consistency.
Contribution
The paper presents the first GeoS2L task, constructs large-scale datasets, proposes the SounDiT diffusion transformer model, and introduces the Place Similarity Score for evaluation.
Findings
SounDiT outperforms baselines in GeoS2L tasks.
The Place Similarity Score effectively measures generation consistency.
Extensive experiments validate the model's ability to produce geographically coherent landscapes.
Abstract
Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Noise Effects and Management
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion · Byte Pair Encoding · Label Smoothing · Adam · Softmax
