AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

Sihan Lv; Yechen Jin; Zhen Li; Jintao Chen; Jinshan Zhang; Ying Li; Jianwei Yin; Meng Xi

arXiv:2604.16056·cs.SD·April 20, 2026

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

Sihan Lv, Yechen Jin, Zhen Li, Jintao Chen, Jinshan Zhang, Ying Li, Jianwei Yin, Meng Xi

PDF

TL;DR

AST is a training-free speech editing framework that uses latent recomposition and adaptive guidance to achieve seamless, high-quality modifications while preserving speaker identity and temporal consistency.

Contribution

It introduces a novel training-free approach with latent manipulation and adaptive guidance, along with a new dataset and evaluation metric for speech editing.

Findings

01

AST improves temporal consistency and speaker preservation over baselines.

02

AST reduces Word Error Rate by nearly 70% compared to previous methods.

03

AST achieves state-of-the-art results in speech editing quality and fidelity.

Abstract

Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.