DiffEditor: Enhancing Speech Editing with Semantic Enrichment and   Acoustic Consistency

Yang Chen; Yuhang Jia; Shiwan Zhao; Ziyue Jiang; Haoran Li; Jiarong; Kang; Yong Qin

arXiv:2409.12992·cs.SD·September 23, 2024

DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency

Yang Chen, Yuhang Jia, Shiwan Zhao, Ziyue Jiang, Haoran Li, Jiarong, Kang, Yong Qin

PDF

Open Access 1 Repo

TL;DR

DiffEditor is a novel speech editing model that improves intelligibility and acoustic consistency in out-of-domain text scenarios by semantic enrichment and a smoothing loss.

Contribution

The paper introduces DiffEditor, which enhances speech editing performance in OOD scenarios through semantic enrichment and a novel smoothing loss for acoustic consistency.

Findings

01

Achieves state-of-the-art results in in-domain and OOD scenarios.

02

Improves speech intelligibility and fluency at editing boundaries.

03

Effectively maintains acoustic consistency during editing.

Abstract

As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nku-hlt/diffeditor
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis