SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech   Editing and Synthesis

Helin Wang; Meng Yu; Jiarui Hai; Chen Chen; Yuchen Hu; Rilin Chen,; Najim Dehak; Dong Yu

arXiv:2409.07556·eess.AS·January 3, 2025

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen,, Najim Dehak, Dong Yu

PDF

Open Access 1 Repo 2 Models

TL;DR

SSR-Speech is a novel neural codec model for stable, safe, and robust zero-shot speech editing and synthesis, incorporating watermarking and leveraging original speech segments for high-quality reconstruction.

Contribution

It introduces SSR-Speech, a Transformer-based model with classifier-free guidance and watermarking, achieving state-of-the-art results in speech editing and synthesis tasks.

Findings

01

State-of-the-art performance in RealEdit speech editing

02

Superior quality in LibriTTS text-to-speech synthesis

03

Robustness to background noise and multi-span editing

Abstract

In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WangHelin1997/SSR-Speech
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer