Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

Wenjie Tian; Xinfa Zhu; Hanke Xie; Zhen Ye; Wei Xue; Lei Xie

arXiv:2508.06262·cs.SD·August 11, 2025

Llasa+: Free Lunch for Accelerated and Streaming Llama-Based Speech Synthesis

Wenjie Tian, Xinfa Zhu, Hanke Xie, Zhen Ye, Wei Xue, Lei Xie

PDF

Open Access

TL;DR

Llasa+ is a novel streaming TTS model that significantly accelerates speech synthesis by predicting multiple tokens simultaneously and verifying them, achieving faster inference without quality loss.

Contribution

The paper introduces Llasa+, which employs multi-token prediction and a verification algorithm to speed up Llama-based TTS while maintaining high quality.

Findings

01

Achieves 1.48X speedup in speech synthesis

02

Maintains high quality despite acceleration

03

Applicable to other LLM-based models

Abstract

Recent progress in text-to-speech (TTS) has achieved impressive naturalness and flexibility, especially with the development of large language model (LLM)-based approaches. However, existing autoregressive (AR) structures and large-scale models, such as Llasa, still face significant challenges in inference latency and streaming synthesis. To deal with the limitations, we introduce Llasa+, an accelerated and streaming TTS model built on Llasa. Specifically, to accelerate the generation process, we introduce two plug-and-play Multi-Token Prediction (MTP) modules following the frozen backbone. These modules allow the model to predict multiple tokens in one AR step. Additionally, to mitigate potential error propagation caused by inaccurate MTP, we design a novel verification algorithm that leverages the frozen backbone to validate the generated tokens, thus allowing Llasa+ to achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Infant Health and Development