Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations
Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita,, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

TL;DR
This paper introduces methods to enhance the robustness of target speech extraction systems against enrollment variations by focusing on worst-case performance and employing auxiliary speaker identification loss.
Contribution
It proposes a new evaluation metric for robustness, a training scheme for worst-case performance optimization, and investigates the use of SI-loss to improve speaker discriminability.
Findings
Worst-enrollment SDR effectively measures robustness.
Training with difficult enrollments improves performance.
SI-loss enhances robustness by increasing speaker discriminability.
Abstract
Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
