Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based   Speech Synthesis

Zhen Ye; Xinfa Zhu; Chi-Min Chan; Xinsheng Wang; Xu Tan; Jiahe Lei; Yi; Peng; Haohe Liu; Yizhu Jin; Zheqi Dai; Hongzhan Lin; Jianyi Chen; Xingjian; Du; Liumeng Xue; Yunlin Chen; Zhifei Li; Lei Xie; Qiuqiang Kong; Yike Guo,; and Wei Xue

arXiv:2502.04128·eess.AS·February 25, 2025

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi, Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian, Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo,, and Wei Xue

PDF

Open Access 1 Repo 10 Models 2 Datasets

TL;DR

This paper introduces Llasa, a scalable speech synthesis framework using a single Transformer and VQ codec, demonstrating improved naturalness, prosody, and expressiveness through scaled training and inference compute, with publicly available models and code.

Contribution

The work presents Llasa, a unified TTS model aligned with LLMs, and explores scaling compute at training and inference stages to enhance speech quality and expressiveness.

Findings

01

Scaling train-time compute improves speech naturalness.

02

Scaling inference-time compute enhances expressiveness and accuracy.

03

Public release of models and training code for reproducibility.

Abstract

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhenye234/LLaSA_training
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research

MethodsAttention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections