Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

Kenta Udagawa; Yuki Saito; Hiroshi Saruwatari

arXiv:2206.10256·cs.SD·June 22, 2022

Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS

Kenta Udagawa, Yuki Saito, Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces a human-in-the-loop approach for speaker adaptation in multi-speaker TTS, enabling target speaker embedding estimation without reference speech by interactive user exploration of the embedding space.

Contribution

It presents a novel interactive optimization framework that allows users to find speaker embeddings through sequential line searches, bypassing the need for reference speech.

Findings

01

Achieves comparable performance to traditional methods in objective evaluations.

02

Enables speaker adaptation without reference speech.

03

Uses a user-guided exploration method for embedding estimation.

Abstract

This paper proposes a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech. With a conventional speaker-adaptation method, a target speaker's embedding vector is extracted from his/her reference speech using a speaker encoder trained on a speaker-discriminative task. However, this method cannot obtain an embedding vector for the target speaker when the reference speech is unavailable. Our method is based on a human-in-the-loop optimization framework, which incorporates a user to explore the speaker-embedding space to find the target speaker's embedding. The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space. To efficiently choose the best speech sample from multiple stimuli, we also developed a system in which a user can switch between multiple speakers' voices for each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Speech and dialogue systems