SRTNet: Time Domain Speech Enhancement Via Stochastic Refinement
Zhibin Qiu, Mengfan Fu, Yinfeng Yu, LiLi Yin, Fuchun Sun, Hao Huang

TL;DR
SRTNet introduces a novel time-domain speech enhancement method using stochastic refinement with a joint deterministic and stochastic network, demonstrating faster training and sampling with improved quality over existing approaches.
Contribution
The paper presents SRTNet, a new stochastic refinement framework for speech enhancement in the time domain, combining deterministic and stochastic modules for improved performance.
Findings
Faster training and sampling compared to traditional diffusion models
Higher quality speech enhancement results
Feasibility demonstrated both theoretically and experimentally
Abstract
Diffusion model, as a new generative model which is very popular in image generation and audio synthesis, is rarely used in speech enhancement. In this paper, we use the diffusion model as a module for stochastic refinement. We propose SRTNet, a novel method for speech enhancement via Stochastic Refinement in complete Time domain. Specifically, we design a joint network consisting of a deterministic module and a stochastic module, which makes up the ``enhance-and-refine'' paradigm. We theoretically demonstrate the feasibility of our method and experimentally prove that our method achieves faster training, faster sampling and higher quality. Our code and enhanced samples are available at https://github.com/zhibinQiu/SRTNet.git.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
