PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement
Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

TL;DR
This paper introduces PASE, a speech enhancement framework that leverages the phonological prior from WavLM to reduce hallucinations and improve perceptual quality in noisy speech.
Contribution
The paper proposes a novel generative speech enhancement method that uses WavLM's phonological prior and dual-stream vocoder training to mitigate hallucinations and enhance speech quality.
Findings
PASE outperforms state-of-the-art models in perceptual quality.
Significantly reduces linguistic hallucinations compared to prior methods.
Achieves lower acoustic hallucinations while maintaining speech naturalness.
Abstract
Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively. We argue that linguistic hallucination stems from models' failure to constrain valid phonological structures and it is a more fundamental challenge. While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations. To overcome these limitations, we propose the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis
