Diffusion Buffer for Online Generative Speech Enhancement

Bunlong Lay; Rostislav Makarov; Simon Welker; Maris Hillemann; Timo Gerkmann

arXiv:2510.18744·eess.AS·October 22, 2025

Diffusion Buffer for Online Generative Speech Enhancement

Bunlong Lay, Rostislav Makarov, Simon Welker, Maris Hillemann, Timo Gerkmann

PDF

Open Access

TL;DR

This paper introduces the Diffusion Buffer, a novel online generative speech enhancement model that reduces latency significantly while outperforming predictive models, enabling real-time enhancement on consumer hardware.

Contribution

The work presents a diffusion-based online speech enhancement method with a new neural network architecture and loss function, achieving low latency and improved performance.

Findings

01

Reduces algorithmic latency from 320-960 ms to 32-176 ms.

02

Outperforms predictive models on unseen noisy speech data.

03

Uses a 2D UNet architecture aligned with diffusion look-ahead.

Abstract

Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Advanced Adaptive Filtering Techniques · Speech Recognition and Synthesis