DisContSE: Single-Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings

Yihui Fu; Tim Fingscheidt

arXiv:2601.21940·eess.AS·January 30, 2026

DisContSE: Single-Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings

Yihui Fu, Tim Fingscheidt

PDF

Open Access

TL;DR

DisContSE introduces a novel single-step diffusion speech enhancement model that combines discrete and continuous embeddings, significantly improving speech quality and phonetic accuracy while reducing inference complexity.

Contribution

It is the first to achieve single-step diffusion speech enhancement using joint discrete and continuous audio codec features with a novel quantization error mask initialization.

Findings

01

Outperforms existing diffusion baselines in PESQ, POLQA, UTMOS

02

Achieves top subjective listening test scores

03

Reduces inference complexity with single-step process

Abstract

Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation