CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech   Editing

Tao Wang; Jiangyan Yi; Ruibo Fu; Jianhua Tao; Zhengqi Wen

arXiv:2202.09950·cs.SD·March 23, 2022

CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing

Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

PDF

Open Access 3 Repos

TL;DR

CampNet introduces a context-aware mask prediction network for natural and flexible text-based speech editing, capable of handling unseen words and various editing operations with improved naturalness over existing methods.

Contribution

The paper presents a novel end-to-end speech editing model that predicts masked speech regions based on context, supporting deletion, insertion, and replacement operations with few-shot speaker adaptation.

Findings

01

CampNet outperforms TTS, manual editing, and VoCo in speech naturalness.

02

The model effectively handles unseen words during synthesis.

03

Speaker adaptation with one sentence enhances naturalness.

Abstract

The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet). The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, for the possible operation of text-based speech editing, we design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings