Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion
Arnon Turetzky, Avihu Dekel, Nimrod Shabtay, Slava Shechtman, David Haws, Hagai Aronowitz, Ron Hoory, Yossi Adi

TL;DR
This paper introduces SALAD, a zero-shot text-to-speech model that uses a novel per-token diffusion process over continuous speech features, achieving high intelligibility and quality.
Contribution
The paper proposes SALAD, a continuous feature-based diffusion model for zero-shot TTS, and provides a comprehensive comparison with discrete models and existing systems.
Findings
SALAD outperforms discrete variants in speech intelligibility
SALAD matches ground-truth speech quality and speaker similarity
Continuous modeling techniques can be more effective than discrete ones in TTS
Abstract
We present SALAD, a zero-shot TTS autoregressive model operating over continuous speech representations. SALAD utilizes a per-token diffusion process to refine and predict continuous representations for the next time step. We compare our approach against a discrete variant of SALAD as well as publicly available zero-shot TTS systems, and conduct a comprehensive analysis of discrete versus continuous modeling techniques. Our results show that SALAD achieves superior intelligibility while matching the speech quality and speaker similarity of ground-truth audio.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
MethodsDiffusion · Latent Diffusion Model
