EditSpeech: A Text Based Speech Editing System Using Partial Inference   and Bidirectional Fusion

Daxin Tan; Liqun Deng; Yu Ting Yeung; Xin Jiang; Xiao Chen; Tan Lee

arXiv:2107.01554·eess.AS·October 11, 2021·1 cites

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

Daxin Tan, Liqun Deng, Yu Ting Yeung, Xin Jiang, Xiao Chen, Tan Lee

PDF

Open Access 3 Repos

TL;DR

EditSpeech is a neural speech editing system that enables seamless word deletion, insertion, and replacement in speech utterances with minimal quality degradation, using partial inference and bidirectional fusion to maintain naturalness.

Contribution

The paper introduces partial inference and bidirectional fusion techniques to improve speech editing quality in neural TTS systems, ensuring smooth transitions and minimal distortion.

Findings

01

Outperforms baseline systems in spectral distortion metrics

02

Achieves higher subjective speech quality in evaluations

03

Effective in multi-speaker English and Chinese scenarios

Abstract

This paper presents the design, implementation and evaluation of a speech editing system, named EditSpeech, which allows a user to perform deletion, insertion and replacement of words in a given speech utterance, without causing audible degradation in speech quality and naturalness. The EditSpeech system is developed upon a neural text-to-speech (NTTS) synthesis framework. Partial inference and bidirectional fusion are proposed to effectively incorporate the contextual information related to the edited region and achieve smooth transition at both left and right boundaries. Distortion introduced to the unmodified parts of the utterance is alleviated. The EditSpeech system is developed and evaluated on English and Chinese in multi-speaker scenarios. Objective and subjective evaluation demonstrate that EditSpeech outperforms a few baseline systems in terms of low spectral distortion and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing