SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Hanke Xie; Haopeng Lin; Wenxiao Cao; Dake Guo; Wenjie Tian; Jun Wu; Hanlin Wen; Ruixuan Shang; Hongmei Liu; Zhiqi Jiang; Yuepeng Jiang; Wenxi Chen; Ruiqi Yan; Jiale Qian; Yichao Yan; Shunshun Yin; Ming Tao; Xie Chen; Lei Xie; and Xinsheng Wang

arXiv:2510.23541·eess.AS·October 29, 2025

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, and Xinsheng Wang

PDF

3 Models

TL;DR

SoulX-Podcast is a multi-speaker, multi-turn speech synthesis system that generates natural, dialectally diverse podcast dialogues with stable speaker identity and adaptive prosody, advancing the realism of conversational TTS.

Contribution

It introduces a novel system capable of producing realistic, multi-speaker, multi-turn dialogues with dialectal and paralinguistic diversity, achieving state-of-the-art results in conversational speech synthesis.

Findings

01

Produces over 90 minutes of stable, multi-speaker dialogue.

02

Achieves state-of-the-art performance in monologue and conversational TTS.

03

Supports multiple languages and Chinese dialects.

Abstract

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.