GLM-TTS Technical Report

Jiayan Cui; Zhihan Yang; Naihan Li; Jiankun Tian; Xingyu Ma; Yi Zhang; Guangyu Chen; Runxuan Yang; Yuqing Cheng; Yizhi Zhou; Guochen Yu; Xiaotao Gu; Jie Tang

arXiv:2512.14291·cs.SD·December 17, 2025

GLM-TTS Technical Report

Jiayan Cui, Zhihan Yang, Naihan Li, Jiankun Tian, Xingyu Ma, Yi Zhang, Guangyu Chen, Runxuan Yang, Yuqing Cheng, Yizhi Zhou, Guochen Yu, Xiaotao Gu, Jie Tang

PDF

Open Access 1 Models

TL;DR

GLM-TTS is a high-fidelity, efficient, and controllable text-to-speech system that leverages a two-stage architecture and reinforcement learning to optimize speech quality and customization with limited training data.

Contribution

The paper introduces a novel two-stage TTS architecture with reinforcement learning and parameter-efficient customization for production-level speech synthesis.

Findings

01

Achieves state-of-the-art performance on open-source benchmarks

02

Utilizes only 100k hours of training data

03

Enables real-time, controllable speech synthesis

Abstract

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zai-org/GLM-TTS
model· 1.2k dl· ♡ 329
1.2k dl♡ 329

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research