SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

Zeyu Xie; Chenxing Li; Qiao Jin; Xuenan Xu; Guanrou Yang; Wenfu Wang; Mengyue Wu; Dong Yu; Yuexian Zou

arXiv:2602.23333·cs.SD·February 27, 2026

SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

Zeyu Xie, Chenxing Li, Qiao Jin, Xuenan Xu, Guanrou Yang, Wenfu Wang, Mengyue Wu, Dong Yu, Yuexian Zou

PDF

Open Access

TL;DR

SemanticVocoder introduces a novel approach that uses semantic encoder latents instead of acoustic VAE latents, enabling more discriminative audio generation and a unified semantic space for understanding and synthesis.

Contribution

It proposes replacing VAE acoustic latents with semantic encoder latents for improved audio synthesis and understanding.

Findings

01

Achieves a Frechet Distance of 12.823 on AudioCaps.

02

Achieves a Frechet Audio Distance of 1.709.

03

Demonstrates superior semantic discriminability over VAE latents.

Abstract

Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Music and Audio Processing