Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Zhengrui Ma; Yang Feng; Chenze Shao; Fandong Meng; Jie Zhou; Min Zhang

arXiv:2505.13181·cs.CL·October 27, 2025

Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang

PDF

1 Repo 2 Models

TL;DR

SLED introduces a continuous latent space approach for speech language modeling using energy distance, simplifying the pipeline and improving efficiency in speech synthesis tasks.

Contribution

It presents a novel continuous latent space modeling method with energy distance, avoiding discretization and hierarchical complexities of prior models.

Findings

01

Achieves strong zero-shot speech synthesis performance

02

Effective in streaming speech synthesis scenarios

03

Simplifies speech modeling pipeline

Abstract

We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ictnlp/sled-tts
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.