StuPASE: Towards Low-Hallucination Studio-Quality Generative Speech Enhancement
Xiaobin Rong, Jun Gao, Zheng Wang, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

TL;DR
StuPASE is a novel speech enhancement model that combines finetuning and flow-matching techniques to deliver studio-quality, low-hallucination speech even in noisy and reverberant environments.
Contribution
The paper introduces StuPASE, a new approach that improves perceptual quality and reduces hallucination in speech enhancement by finetuning PASE with dry targets and replacing the generative module with flow-matching.
Findings
Outperforms state-of-the-art SE methods in perceptual quality.
Maintains low hallucination across various challenging conditions.
Achieves studio-quality speech enhancement in adverse environments.
Abstract
Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achieve studio-level quality while retaining its low-hallucination property. First, we show that finetuning PASE with dry targets rather than targets containing simulated early reflections substantially improves dereverberation. Second, to address performance limitations under strong additive noise, we replace the GAN-based generative module in PASE with a flow-matching module, enabling studio-quality generation even under highly challenging conditions. Experiments demonstrate that StuPASE consistently produces perceptually high-quality speech while maintaining low hallucination, outperforming state-of-the-art SE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Advanced Image Processing Techniques
