DisContSE: Single-Step Diffusion Speech Enhancement Based on Joint Discrete and Continuous Embeddings
Yihui Fu, Tim Fingscheidt

TL;DR
DisContSE introduces a novel single-step diffusion speech enhancement model that combines discrete and continuous embeddings, significantly improving speech quality and phonetic accuracy while reducing inference complexity.
Contribution
It is the first to achieve single-step diffusion speech enhancement using joint discrete and continuous audio codec features with a novel quantization error mask initialization.
Findings
Outperforms existing diffusion baselines in PESQ, POLQA, UTMOS
Achieves top subjective listening test scores
Reduces inference complexity with single-step process
Abstract
Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Hearing Loss and Rehabilitation
