Investigating training objective for flow matching-based speech enhancement
Liusha Yang, Ziru Ge, Gui Zhang, Junan Zhang, Zhizheng Wu

TL;DR
This paper systematically studies flow matching for speech enhancement, comparing training objectives and introducing perceptual and signal-based metrics to improve convergence and speech quality.
Contribution
It provides a comprehensive analysis of different flow matching training objectives and incorporates perceptual and signal-based metrics for better performance.
Findings
Preconditioned $x_1$ prediction improves training stability.
Incorporating PESQ and SI-SDR enhances speech quality.
Flow matching achieves efficient speech enhancement with improved metrics.
Abstract
Speech enhancement(SE) aims to recover clean speech from noisy recordings. Although generative approaches such as score matching and Schrodinger bridge have shown strong effectiveness, they are often computationally expensive. Flow matching offers a more efficient alternative by directly learning a velocity field that maps noise to data. In this work, we present a systematic study of flow matching for SE under three training objectives: velocity prediction, prediction, and preconditioned prediction. We analyze their impact on training dynamics and overall performance. Moreover, by introducing perceptual(PESQ) and signal-based(SI-SDR) objectives, we further enhance convergence efficiency and speech quality, yielding substantial improvements across evaluation metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis
