Strategies to Improve Robustness of Target Speech Extraction to   Enrollment Variations

Hiroshi Sato; Tsubasa Ochiai; Marc Delcroix; Keisuke Kinoshita,; Takafumi Moriya; Naoki Makishima; Mana Ihori; Tomohiro Tanaka; Ryo Masumura

arXiv:2206.08174·eess.AS·June 17, 2022

Strategies to Improve Robustness of Target Speech Extraction to Enrollment Variations

Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita,, Takafumi Moriya, Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Ryo Masumura

PDF

Open Access

TL;DR

This paper introduces methods to enhance the robustness of target speech extraction systems against enrollment variations by focusing on worst-case performance and employing auxiliary speaker identification loss.

Contribution

It proposes a new evaluation metric for robustness, a training scheme for worst-case performance optimization, and investigates the use of SI-loss to improve speaker discriminability.

Findings

01

Worst-enrollment SDR effectively measures robustness.

02

Training with difficult enrollments improves performance.

03

SI-loss enhances robustness by increasing speaker discriminability.

Abstract

Target speech extraction is a technique to extract the target speaker's voice from mixture signals using a pre-recorded enrollment utterance that characterize the voice characteristics of the target speaker. One major difficulty of target speech extraction lies in handling variability in ``intra-speaker'' characteristics, i.e., characteristics mismatch between target speech and an enrollment utterance. While most conventional approaches focus on improving {\it average performance} given a set of enrollment utterances, here we propose to guarantee the {\it worst performance}, which we believe is of great practical importance. In this work, we propose an evaluation metric called worst-enrollment source-to-distortion ratio (SDR) to quantitatively measure the robustness towards enrollment variations. We also introduce a novel training scheme that aims at directly optimizing the worst-case…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing