Absorbing Discrete Diffusion for Speech Enhancement
Philippe Gonzalez

TL;DR
This paper introduces ADDSE, a novel diffusion-based method for speech enhancement that models clean speech codes conditioned on noisy inputs, leveraging neural codecs and hierarchical diffusion modeling.
Contribution
It proposes a new absorbing discrete diffusion approach for speech enhancement and introduces RQDiT for hierarchical residual vector quantization modeling.
Findings
Competitive performance on two datasets
Effective at low SNR and few sampling steps
Combines neural codecs with diffusion models
Abstract
Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
