Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion

Arnon Turetzky; Avihu Dekel; Nimrod Shabtay; Slava Shechtman; David Haws; Hagai Aronowitz; Ron Hoory; Yossi Adi

arXiv:2410.16048·eess.AS·November 25, 2025

Speech Synthesis From Continuous Features Using Per-Token Latent Diffusion

Arnon Turetzky, Avihu Dekel, Nimrod Shabtay, Slava Shechtman, David Haws, Hagai Aronowitz, Ron Hoory, Yossi Adi

PDF

Open Access

TL;DR

This paper introduces SALAD, a zero-shot text-to-speech model that uses a novel per-token diffusion process over continuous speech features, achieving high intelligibility and quality.

Contribution

The paper proposes SALAD, a continuous feature-based diffusion model for zero-shot TTS, and provides a comprehensive comparison with discrete models and existing systems.

Findings

01

SALAD outperforms discrete variants in speech intelligibility

02

SALAD matches ground-truth speech quality and speaker similarity

03

Continuous modeling techniques can be more effective than discrete ones in TTS

Abstract

We present SALAD, a zero-shot TTS autoregressive model operating over continuous speech representations. SALAD utilizes a per-token diffusion process to refine and predict continuous representations for the next time step. We compare our approach against a discrete variant of SALAD as well as publicly available zero-shot TTS systems, and conduct a comprehensive analysis of discrete versus continuous modeling techniques. Our results show that SALAD achieves superior intelligibility while matching the speech quality and speaker similarity of ground-truth audio.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems

MethodsDiffusion · Latent Diffusion Model