Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

Rui Liu; Pu Gao; Jiatian Xi; Berrak Sisman; Carlos Busso; Haizhou Li

arXiv:2505.20341·eess.AS·May 28, 2025

Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li

PDF

Open Access 1 Datasets

TL;DR

This paper introduces EmoCorrector, a post-correction method for text-based speech editing that improves emotional consistency, supported by a new dataset, ECD-TSE, and validated through experiments showing enhanced emotional expression.

Contribution

The paper presents EmoCorrector, a novel retrieval-augmented generation approach for emotional correction in TSE, and introduces the ECD-TSE dataset for benchmarking emotional consistency.

Findings

01

EmoCorrector significantly improves emotional expression in synthetic speech.

02

ECD-TSE dataset enables comprehensive evaluation of emotional consistency.

03

Experiments show enhanced alignment of speech emotion with text edits.

Abstract

Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$ text, speech $>$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Gaphy/ECD-TSE
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus