DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

Yang Yang; Yunpeng Li; George Sung; Shao-Fu Shih; Craig Dooley; Alessio Centazzo; Ramanan Rajeswaran

arXiv:2506.22362·eess.AS·June 30, 2025

DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding

Yang Yang, Yunpeng Li, George Sung, Shao-Fu Shih, Craig Dooley, Alessio Centazzo, Ramanan Rajeswaran

PDF

Open Access

TL;DR

DiffSoundStream introduces an efficient speech tokenization method using diffusion models, reducing token rate constraints and maintaining high speech quality, with minimal diffusion steps.

Contribution

It proposes a novel diffusion-based approach for speech synthesis that improves efficiency and reduces token rate constraints in speech tokenization.

Findings

01

Achieves speech quality comparable to higher token rate models.

02

Reduces diffusion sampling steps to four with minor quality loss.

03

Maintains high-quality speech synthesis at 50 tokens/sec.

Abstract

Token-based language modeling is a prominent approach for speech generation, where tokens are obtained by quantizing features from self-supervised learning (SSL) models and extracting codes from neural speech codecs, generally referred to as semantic tokens and acoustic tokens. These tokens are often modeled autoregressively, with the inference speed being constrained by the token rate. In this work, we propose DiffSoundStream, a solution that improves the efficiency of speech tokenization in non-streaming scenarios through two techniques: (1) conditioning the neural codec on semantic tokens to minimize redundancy between semantic and acoustic tokens, and (2) leveraging latent diffusion models to synthesize high-quality waveforms from semantic and coarse-level acoustic tokens. Experiments show that at 50 tokens per second, DiffSoundStream achieves speech quality on par with a standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Language Development and Disorders