MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Kexin Huang; Liwei Fan; Botian Jiang; Yaozhou Jiang; Qian Tu; Jie Zhu; Yuqian Zhang; Yiwei Zhao; Chenchen Yang; Zhaoye Fei; Shimin Li; Xiaogui Yang; Qinyuan Cheng; Xipeng Qiu

arXiv:2603.28086·cs.SD·March 31, 2026

MOSS-VoiceGenerator: Create Realistic Voices with Natural Language Descriptions

Kexin Huang, Liwei Fan, Botian Jiang, Yaozhou Jiang, Qian Tu, Jie Zhu, Yuqian Zhang, Yiwei Zhao, Chenchen Yang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu

PDF

TL;DR

MOSS-VoiceGenerator is an open-source model that creates realistic, expressive voices from natural language descriptions by training on cinematic speech data, outperforming existing models in naturalness and instruction-following.

Contribution

It introduces a novel approach to voice design from text prompts using large-scale expressive data, enhancing naturalness and versatility in voice generation.

Findings

01

Subjective studies show superior naturalness over existing models.

02

Training on cinematic content improves perceptual realism.

03

The model effectively follows natural language instructions for voice creation.

Abstract

Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.