JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

Fan Yu; Tao Wang; You Wu; Lin Zhu; Wei Deng; Weisheng Han; Wenchao Wang; Lin Hu; Xiangyu Liang; Xiaodong He; Yankun Huang; Yu Gu; Yuan Liu; Yuxuan Wang; Zhangyu Xiao; Ziteng Wang; Boya Dong; Feng Dang; Jinming Chen; Jingdong Li; Jun Wang; Yechen Jin; Yuan Zhang; Zhengyan Sheng; Xin Wang

arXiv:2512.19090·cs.SD·December 23, 2025

JoyVoice: Long-Context Conditioning for Anthropomorphic Multi-Speaker Conversational Synthesis

Fan Yu, Tao Wang, You Wu, Lin Zhu, Wei Deng, Weisheng Han, Wenchao Wang, Lin Hu, Xiangyu Liang, Xiaodong He, Yankun Huang, Yu Gu, Yuan Liu, Yuxuan Wang, Zhangyu Xiao, Ziteng Wang, Boya Dong, Feng Dang, Jinming Chen, Jingdong Li, Jun Wang, Yechen Jin, Yuan Zhang, Zhengyan Sheng

PDF

Open Access

TL;DR

JoyVoice is a novel multi-speaker conversational speech synthesis model that enables boundary-free, long-form, multilingual, and zero-shot voice cloning with superior naturalness and prosody, using a unified end-to-end transformer architecture.

Contribution

Introduces JoyVoice, a unified end-to-end transformer-based model for flexible, boundary-free multi-speaker conversational speech synthesis with a novel MM-Tokenizer and robust data processing.

Findings

01

Achieves state-of-the-art multilingual generation and zero-shot voice cloning.

02

Demonstrates superior prosodic continuity and naturalness in long-form speech.

03

Outperforms existing models on Seed-TTS-Eval and multi-speaker tasks.

Abstract

Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Topic Modeling