DiffEditor: Enhancing Speech Editing with Semantic Enrichment and Acoustic Consistency
Yang Chen, Yuhang Jia, Shiwan Zhao, Ziyue Jiang, Haoran Li, Jiarong, Kang, Yong Qin

TL;DR
DiffEditor is a novel speech editing model that improves intelligibility and acoustic consistency in out-of-domain text scenarios by semantic enrichment and a smoothing loss.
Contribution
The paper introduces DiffEditor, which enhances speech editing performance in OOD scenarios through semantic enrichment and a novel smoothing loss for acoustic consistency.
Findings
Achieves state-of-the-art results in in-domain and OOD scenarios.
Improves speech intelligibility and fluency at editing boundaries.
Effectively maintains acoustic consistency during editing.
Abstract
As text-based speech editing becomes increasingly prevalent, the demand for unrestricted free-text editing continues to grow. However, existing speech editing techniques encounter significant challenges, particularly in maintaining intelligibility and acoustic consistency when dealing with out-of-domain (OOD) text. In this paper, we introduce, DiffEditor, a novel speech editing model designed to enhance performance in OOD text scenarios through semantic enrichment and acoustic consistency. To improve the intelligibility of the edited speech, we enrich the semantic information of phoneme embeddings by integrating word embeddings extracted from a pretrained language model. Furthermore, we emphasize that interframe smoothing properties are critical for modeling acoustic consistency, and thus we propose a first-order loss function to promote smoother transitions at editing boundaries and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
