Real-Time Streamable Generative Speech Restoration with Flow Matching
Simon Welker, Bunlong Lay, Maris Hillemann, Tal Peer, Timo Gerkmann

TL;DR
This paper introduces Stream.FM, a real-time, flow-based generative speech restoration model with low latency suitable for consumer GPUs, advancing streaming speech processing capabilities.
Contribution
The paper presents a novel low-latency, streaming-compatible flow-based model for speech restoration, including optimized architecture, inference scheme, and model compression techniques.
Findings
Stream.FM achieves 48 ms total latency for real-time speech processing.
It outperforms previous diffusion-based models in streaming speech enhancement.
High-quality speech restoration is feasible on consumer GPUs with the proposed methods.
Abstract
Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present StreamFM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
