ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement

Zhong-Qiu Wang

arXiv:2407.19485·eess.AS·September 25, 2025

ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement

Zhong-Qiu Wang

PDF

Open Access

TL;DR

This paper introduces ctPuLSE, a novel training approach for far-field speech enhancement that leverages close-talk speech as pseudo-labels to improve real-world generalization of enhancement models.

Contribution

It proposes a new method to train far-field speech enhancement models directly on real data using pseudo-labels generated from close-talk speech enhancement.

Findings

01

ctPuLSE produces high-quality pseudo-labels.

02

Models trained with ctPuLSE generalize well to real data.

03

Significant improvement over traditional supervised methods.

Abstract

The current dominant approach for neural speech enhancement is via purely-supervised deep learning on simulated pairs of far-field noisy-reverberant speech (i.e., mixtures) and clean speech. The trained models, however, often exhibit limited generalizability to real-recorded mixtures. To deal with this, this paper investigates training enhancement models directly on real mixtures. However, a major difficulty challenging this approach is that, since the clean speech of real mixtures is unavailable, there lacks a good supervision for real mixtures. In this context, assuming that a training set consisting of real-recorded pairs of close-talk and far-field mixtures is available, we propose to address this difficulty via close-talk speech enhancement, where an enhancement model is first trained on simulated mixtures to enhance real-recorded close-talk mixtures and the estimated close-talk…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training