Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency

Bunlong Lay; Rostislav Makarov; Timo Gerkmann

arXiv:2506.02908·eess.AS·September 15, 2025

Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency

Bunlong Lay, Rostislav Makarov, Timo Gerkmann

PDF

Open Access

TL;DR

This paper introduces Diffusion Buffer, a real-time speech enhancement method using diffusion models with sub-second latency, balancing performance and delay for streaming applications.

Contribution

It adapts a sliding window diffusion framework for online speech enhancement, enabling practical, low-latency processing with improved results over standard diffusion models.

Findings

01

Outperforms standard diffusion models in speech enhancement

02

Achieves 0.3 to 1 second latency on GPU

03

First practical diffusion-based online speech enhancement solution

Abstract

Diffusion models are a class of generative models that have been recently used for speech enhancement with remarkable success but are computationally expensive at inference time. Therefore, these models are impractical for processing streaming data in real-time. In this work, we adapt a sliding window diffusion framework to the speech enhancement task. Our approach progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer. This approach outputs denoised frames with a delay proportional to the chosen buffer size, enabling a trade-off between performance and latency. Empirical results demonstrate that our method outperforms standard diffusion models and runs efficiently on a GPU, achieving an input-output latency in the order of 0.3 to 1 seconds. This marks the first practical diffusion-based solution for online speech enhancement.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Adaptive Filtering Techniques

MethodsDiffusion