CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Junyang Chen; Yuhang Jia; Hui Wang; Jiaming Zhou; Yaxin Han; Mengying Feng; Yong Qin

arXiv:2601.05329·cs.SD·January 12, 2026

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, Yong Qin

PDF

Open Access

TL;DR

CosyEdit is an end-to-end speech editing model that leverages fine-tuned zero-shot TTS models to perform high-quality speech modifications efficiently, outperforming large language model baselines and matching state-of-the-art cascade methods.

Contribution

We introduce CosyEdit, a novel end-to-end speech editing approach that internalizes speech-text alignment, enabling effective editing from a zero-shot TTS model with minimal supervised data.

Findings

01

Outperforms several billion-parameter language model baselines.

02

Matches the performance of state-of-the-art cascade approaches.

03

Achieves high consistency between original and edited speech.

Abstract

Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific fine-tuning and an optimized inference procedure, which internalizes speech-text alignment while ensuring high consistency between the speech before and after editing. By fine-tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion-parameter language model baselines but also matches the performance of state-of-the-art cascade approaches. These results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling