A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction
Zexu Pan, Meng Ge, Haizhou Li

TL;DR
This paper introduces a hybrid continuity loss for time-domain speaker extraction that reduces over-suppression artifacts, improves speech recognition accuracy, and maintains speech quality in mixed audio scenarios.
Contribution
The paper proposes a novel hybrid continuity loss combining waveform-level and frequency-domain losses to address over-suppression in time-domain speaker extraction.
Findings
Reduces over-suppression artifacts in extracted speech.
Improves word error rate in speech recognition tasks.
Maintains high speech quality despite suppression reduction.
Abstract
The speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the over-suppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
