Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Chuanxin Tang; Chong Luo; Zhiyuan Zhao; Dacheng Yin; Yucheng Zhao and; Wenjun Zeng

arXiv:2109.05426·cs.SD·September 14, 2021

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao and, Wenjun Zeng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel one-stage, zero-shot text-to-speech editing framework that generates natural speech for audio narration without requiring target speaker training data, outperforming existing methods.

Contribution

The proposed approach eliminates the need for target speaker training data by using a context-aware, transformer-based model with zero-shot duration prediction for seamless speech editing.

Findings

01

Achieves high-quality speech synthesis without target speaker data

02

Outperforms recent zero-shot TTS engines in subjective tests

03

Provides accurate zero-shot duration prediction for inserted text

Abstract

Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rishikksh20/Zero-Shot-TTS
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing