MOSS-TTS Technical Report
Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang

TL;DR
MOSS-TTS is a scalable, foundation speech generation model utilizing discrete audio tokens and autoregressive modeling, supporting multilingual, zero-shot voice cloning, and detailed control features.
Contribution
Introduces MOSS-TTS, a novel scalable speech synthesis model built on a new audio tokenizer and multiple generation architectures with enhanced control and efficiency.
Findings
Supports zero-shot voice cloning
Enables token-level duration and pronunciation control
Provides stable long-form speech generation
Abstract
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenMOSS-Team/MOSS-TTSmodel· 77k dl· ♡ 35977k dl♡ 359
- 🤗OpenMOSS-Team/MOSS-TTS-Realtimemodel· 70k dl· ♡ 7270k dl♡ 72
- 🤗OpenMOSS-Team/MOSS-TTS-GGUFmodel· 2.9k dl· ♡ 142.9k dl♡ 14
- 🤗OpenMOSS-Team/MOSS-TTS-Local-Transformermodel· 48k dl· ♡ 2348k dl♡ 23
- 🤗mlx-community/MOSS-TTS-8B-8bitmodel· 38 dl· ♡ 138 dl♡ 1
- 🤗ToSee-Norway/MOSS-TTS-Norwegian-LoRAmodel· 13 dl13 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Face recognition and analysis
