Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis
Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi, Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian, Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo,, and Wei Xue

TL;DR
This paper introduces Llasa, a scalable speech synthesis framework using a single Transformer and VQ codec, demonstrating improved naturalness, prosody, and expressiveness through scaled training and inference compute, with publicly available models and code.
Contribution
The work presents Llasa, a unified TTS model aligned with LLMs, and explores scaling compute at training and inference stages to enhance speech quality and expressiveness.
Findings
Scaling train-time compute improves speech naturalness.
Scaling inference-time compute enhances expressiveness and accuracy.
Public release of models and training code for reproducibility.
Abstract
Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗HKUSTAudio/xcodec2model· 66k dl· ♡ 9866k dl♡ 98
- 🤗HKUSTAudio/Llasa-1Bmodel· 11k dl· ♡ 10211k dl♡ 102
- 🤗HKUSTAudio/Llasa-3Bmodel· 371 dl· ♡ 526371 dl♡ 526
- 🤗HKUSTAudio/Llasa-8Bmodel· 127 dl· ♡ 96127 dl♡ 96
- 🤗HKUSTAudio/Llasa-1B-Multilingualmodel· 1.0k dl· ♡ 431.0k dl♡ 43
- 🤗GameRuiner/Llasa-1Bmodel· 3 dl3 dl
- 🤗HKUSTAudio/Llasa-1B-Preserve-TextChatmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗HKUSTAudio/Llasa-3B-Preserve-TextChatmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗HKUSTAudio/Llasa-1B-two-speakers-kore-puckmodel· 9 dl· ♡ 59 dl♡ 5
- 🤗HKUSTAudio/Llasa-1B-multi-speakers-genshin-zh-en-ja-komodel· 2 dl· ♡ 52 dl♡ 5
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsAttention Is All You Need · Discriminative Fine-Tuning · Cosine Annealing · Attention Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Label Smoothing · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections
