MOSS-TTSD: Text to Spoken Dialogue Generation

Yuqian Zhang; Donghua Yu; Zhengyuan Lin; Botian Jiang; Mingshu Chen; Yaozhou Jiang; Yiwei Zhao; Yiyang Zhang; Yucheng Yuan; Hanfu Chen; Kexin Huang; Jun Zhan; Cheng Chang; Zhaoye Fei; Shimin Li; Xiaogui Yang; Qinyuan Cheng; Xipeng Qiu

arXiv:2603.19739·cs.SD·March 23, 2026

MOSS-TTSD: Text to Spoken Dialogue Generation

Yuqian Zhang, Donghua Yu, Zhengyuan Lin, Botian Jiang, Mingshu Chen, Yaozhou Jiang, Yiwei Zhao, Yiyang Zhang, Yucheng Yuan, Hanfu Chen, Kexin Huang, Jun Zhan, Cheng Chang, Zhaoye Fei, Shimin Li, Xiaogui Yang, Qinyuan Cheng, Xipeng Qiu

PDF

Open Access

TL;DR

MOSS-TTSD is a novel spoken dialogue generation model that produces expressive, multi-party conversations with long-term coherence, multi-language support, and zero-shot voice cloning, addressing key challenges in dialogue synthesis.

Contribution

The paper introduces MOSS-TTSD, a new model for long-form, multi-party spoken dialogue generation with enhanced context modeling and a novel evaluation framework, TTSD-eval.

Findings

01

Outperforms existing models in dialogue synthesis quality.

02

Supports up to 60 minutes of synthesis and 5 speakers.

03

Effective zero-shot voice cloning from short reference clips.

Abstract

Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis