Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Saksham Singh Kushwaha; Jianbo Ma; Mark R. P. Thomas; Yapeng Tian; Avery Bruni

arXiv:2410.11299·cs.SD·July 16, 2025

Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Saksham Singh Kushwaha, Jianbo Ma, Mark R. P. Thomas, Yapeng Tian, Avery Bruni

PDF

Open Access

TL;DR

This paper introduces Diff-SAGe, a novel diffusion-transformer model that generates spatial audio directly from sound category and location, outperforming traditional simulation methods in accuracy and realism.

Contribution

We propose an end-to-end diffusion-transformer framework for spatial audio generation, addressing limitations of traditional simulation-based approaches and integrating phase-preserving spectrogram representations.

Findings

01

Outperforms traditional simulation-based baselines in objective metrics

02

Achieves higher subjective quality in spatial audio generation

03

Demonstrates robustness across multiple datasets

Abstract

Spatial audio is a crucial component in creating immersive experiences. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics (FOA) given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end, flow-based diffusion-transformer model for this task. Diff-SAGe utilizes a complex spectrogram representation for FOA, preserving the phase information crucial for accurate spatial cues. Additionally, a multi-conditional encoder integrates the input conditions into a unified representation, guiding the generation of FOA waveforms from noise. Through extensive evaluations on two datasets, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing