SemanticAudio: Audio Generation and Editing in Semantic Space

Zheqi Dai; Guangyan Zhang; Haolin He; Xiquan Li; Jingyu Li; Chunyat Wu; Yiwen Guo; Qiuqiang Kong

arXiv:2601.21402·eess.AS·January 30, 2026

SemanticAudio: Audio Generation and Editing in Semantic Space

Zheqi Dai, Guangyan Zhang, Haolin He, Xiquan Li, Jingyu Li, Chunyat Wu, Yiwen Guo, Qiuqiang Kong

PDF

Open Access

TL;DR

SemanticAudio introduces a high-level semantic space for audio generation and editing, improving alignment with textual descriptions and enabling precise, training-free attribute modifications through a novel two-stage architecture.

Contribution

It proposes a new semantic space and a two-stage Flow Matching architecture for improved text-to-audio generation and editing, with a training-free editing mechanism.

Findings

01

Outperforms existing methods in semantic alignment

02

Enables precise attribute-level audio editing without retraining

03

Demonstrates high-fidelity audio generation from semantic sketches

Abstract

In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Music and Audio Processing