Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency
Bunlong Lay, Rostislav Makarov, Timo Gerkmann

TL;DR
This paper introduces Diffusion Buffer, a real-time speech enhancement method using diffusion models with sub-second latency, balancing performance and delay for streaming applications.
Contribution
It adapts a sliding window diffusion framework for online speech enhancement, enabling practical, low-latency processing with improved results over standard diffusion models.
Findings
Outperforms standard diffusion models in speech enhancement
Achieves 0.3 to 1 second latency on GPU
First practical diffusion-based online speech enhancement solution
Abstract
Diffusion models are a class of generative models that have been recently used for speech enhancement with remarkable success but are computationally expensive at inference time. Therefore, these models are impractical for processing streaming data in real-time. In this work, we adapt a sliding window diffusion framework to the speech enhancement task. Our approach progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer. This approach outputs denoised frames with a delay proportional to the chosen buffer size, enabling a trade-off between performance and latency. Empirical results demonstrate that our method outperforms standard diffusion models and runs efficiently on a GPU, achieving an input-output latency in the order of 0.3 to 1 seconds. This marks the first practical diffusion-based solution for online speech enhancement.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques
MethodsDiffusion
