Investigating training objective for flow matching-based speech enhancement

Liusha Yang; Ziru Ge; Gui Zhang; Junan Zhang; Zhizheng Wu

arXiv:2512.10382·cs.SD·December 12, 2025

Investigating training objective for flow matching-based speech enhancement

Liusha Yang, Ziru Ge, Gui Zhang, Junan Zhang, Zhizheng Wu

PDF

Open Access

TL;DR

This paper systematically studies flow matching for speech enhancement, comparing training objectives and introducing perceptual and signal-based metrics to improve convergence and speech quality.

Contribution

It provides a comprehensive analysis of different flow matching training objectives and incorporates perceptual and signal-based metrics for better performance.

Findings

01

Preconditioned $x_1$ prediction improves training stability.

02

Incorporating PESQ and SI-SDR enhances speech quality.

03

Flow matching achieves efficient speech enhancement with improved metrics.

Abstract

Speech enhancement(SE) aims to recover clean speech from noisy recordings. Although generative approaches such as score matching and Schrodinger bridge have shown strong effectiveness, they are often computationally expensive. Flow matching offers a more efficient alternative by directly learning a velocity field that maps noise to data. In this work, we present a systematic study of flow matching for SE under three training objectives: velocity prediction, $x_{1}$ prediction, and preconditioned $x_{1}$ prediction. We analyze their impact on training dynamics and overall performance. Moreover, by introducing perceptual(PESQ) and signal-based(SI-SDR) objectives, we further enhance convergence efficiency and speech quality, yielding substantial improvements across evaluation metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis