Towards zero-shot Text-based voice editing using acoustic context   conditioning, utterance embeddings, and reference encoders

Jason Fong; Yun Wang; Prabhav Agrawal; Vimal Manohar; Jilong Wu; Thilo; K\"ohler; Qing He

arXiv:2210.16045·cs.SD·October 31, 2022

Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders

Jason Fong, Yun Wang, Prabhav Agrawal, Vimal Manohar, Jilong Wu, Thilo, K\"ohler, Qing He

PDF

Open Access

TL;DR

This paper presents a zero-shot text-based voice editing method that leverages acoustic context conditioning, utterance embeddings, and reference encoders to improve speaker identity and prosody consistency without model finetuning.

Contribution

It introduces a novel zero-shot voice editing approach using pretrained embeddings and reference encoders, eliminating the need for costly finetuning on target speaker data.

Findings

01

Utterance embeddings and reference encoders enhance speaker identity continuity.

02

Subjective tests show improved prosody matching in zero-shot editing.

03

The method avoids finetuning, reducing computational and data privacy concerns.

Abstract

Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording. Recent work has used neural models to produce edited speech that is similar to the original speech in terms of clarity, speaker identity, and prosody. However, one limitation of prior work is the usage of finetuning to optimise performance: this requires further model training on data from the target speaker, which is a costly process that may incorporate potentially sensitive data into server-side models. In contrast, this work focuses on the zero-shot approach which avoids finetuning altogether, and instead uses pretrained speaker verification embeddings together with a jointly trained reference encoder to encode utterance-level information that helps capture aspects such as speaker identity and prosody. Subjective listening tests find that both utterance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing