SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Junbo Wang; Haofeng Tan; Bowen Liao; Albert Jiang; Teng Fei; Qixing Huang; Bing Zhou; Zhengzhong Tu; Shan Ye; Yuhao Kang

arXiv:2505.12734·cs.SD·March 3, 2026

SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, Yuhao Kang

PDF

Open Access 1 Models

TL;DR

This paper introduces GeoS2L, a new task for generating geographically realistic landscape images from environmental soundscapes, supported by large datasets and a diffusion transformer model, with an evaluation framework for consistency.

Contribution

The paper presents the first GeoS2L task, constructs large-scale datasets, proposes the SounDiT diffusion transformer model, and introduces the Place Similarity Score for evaluation.

Findings

01

SounDiT outperforms baselines in GeoS2L tasks.

02

The Place Similarity Score effectively measures generation consistency.

03

Extensive experiments validate the model's ability to produce geographically coherent landscapes.

Abstract

Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on environmental soundscapes. To address this gap, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We propose SounDiT, a diffusion transformer (DiT)-based model that incorporates environmental soundscapes and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
BBO66/SounDiT
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Noise Effects and Management

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion · Byte Pair Encoding · Label Smoothing · Adam · Softmax