Predictive-Generative Drift Decomposition for Speech Enhancement and Separation
Julius Richter, Yoshiki Masuyama, Christoph Boeddeker, Takahiro Edo, Gordon Wichern, Jonathan Le Roux

TL;DR
This paper introduces SIPS, a flexible framework combining predictive speech models with generative priors, improving speech enhancement and separation across various predictors and degradation scenarios.
Contribution
The paper presents a unified, mathematically grounded approach that integrates pretrained predictors with generative models for speech tasks, generalizing across predictors and degradation types.
Findings
Improves perceptual quality by +1.0 NISQA in speech separation.
Effectively combines predictors like SEMamba and FlexIO with generative priors.
Demonstrates robustness across different speech enhancement and separation tasks.
Abstract
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
