CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing
Tao Wang, Jiangyan Yi, Ruibo Fu, Jianhua Tao, Zhengqi Wen

TL;DR
CampNet introduces a context-aware mask prediction network for natural and flexible text-based speech editing, capable of handling unseen words and various editing operations with improved naturalness over existing methods.
Contribution
The paper presents a novel end-to-end speech editing model that predicts masked speech regions based on context, supporting deletion, insertion, and replacement operations with few-shot speaker adaptation.
Findings
CampNet outperforms TTS, manual editing, and VoCo in speech naturalness.
The model effectively handles unseen words during synthesis.
Speaker adaptation with one sentence enhances naturalness.
Abstract
The text-based speech editor allows the editing of speech through intuitive cutting, copying, and pasting operations to speed up the process of editing speech. However, the major drawback of current systems is that edited speech often sounds unnatural due to cut-copy-paste operation. In addition, it is not obvious how to synthesize records according to a new word not appearing in the transcript. This paper proposes a novel end-to-end text-based speech editing method called context-aware mask prediction network (CampNet). The model can simulate the text-based speech editing process by randomly masking part of speech and then predicting the masked region by sensing the speech context. It can solve unnatural prosody in the edited region and synthesize the speech corresponding to the unseen words in the transcript. Secondly, for the possible operation of text-based speech editing, we design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
