Aligning Generative Speech Enhancement with Perceptual Feedback
Haoyang Li, Nana Hou, Yuchen Hu, Jixun Yao, Sabato Marco Siniscalchi, Xuyi Zhuang, Deheng Ye, Wei Yang, Eng Siong Chng

TL;DR
This paper introduces a novel speech enhancement method that aligns model training with human perceptual preferences using neural MOS predictors and Direct Preference Optimization, leading to significant quality improvements.
Contribution
It is the first to incorporate perceptual feedback into LM-based speech enhancement and applies DPO with UTMOS to directly optimize for perceptual quality.
Findings
Achieved up to 56% relative improvement in speech quality metrics.
First integration of perceptual feedback in LM-based speech enhancement.
Demonstrated broad applicability of the approach within LM frameworks.
Abstract
Language Model (LM)-based speech enhancement (SE) has recently emerged as a promising direction, but existing approaches predominantly rely on token-level likelihood objectives that weakly reflect human perception. This mismatch limits progress, as optimizing signal accuracy does not always improve naturalness or listening comfort. We address this gap by introducing a perceptually aligned LM-based SE approach. Our method applies Direct Preference Optimization (DPO) with UTMOS, a neural MOS predictor, as a proxy for human ratings, directly steering models toward perceptually preferred outputs. This design directly connects model training to perceptual quality and is broadly applicable within LM-based SE frameworks. On the Deep Noise Suppression Challenge 2020 test sets, our approach consistently improves speech quality metrics, achieving relative gains of up to 56%. To our knowledge,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis
