Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

Yong Ren; Jiangyan Yi; Jianhua Tao; Zhengqi Wen; Tao Wang

arXiv:2602.00560·cs.SD·February 3, 2026

Edit Content, Preserve Acoustics: Imperceptible Text-Based Speech Editing via Self-Consistency Rewards

Yong Ren, Jiangyan Yi, Jianhua Tao, Zhengqi Wen, Tao Wang

PDF

Open Access

TL;DR

This paper introduces a novel speech editing framework that ensures seamless, imperceptible modifications by decoupling semantic content from acoustics and employing self-consistency rewards, outperforming existing methods.

Contribution

The proposed approach uniquely combines stable semantic editing with acoustic preservation using a flow-based decoder and self-consistency rewards, advancing speech editing technology.

Findings

01

Outperforms state-of-the-art baselines in intelligibility and perceptual quality

02

Achieves robust and seamless speech content modifications

03

Effectively preserves acoustics while editing semantics

Abstract

Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis