Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

TL;DR
This paper introduces Emilia, a large, diverse, multilingual speech dataset derived from real-world sources, enabling more natural and spontaneous speech generation models that outperform traditional datasets in realism and diversity.
Contribution
The paper presents Emilia-Pipe, an open-source pipeline for extracting high-quality, spontaneous speech data from in-the-wild sources, and constructs Emilia and Emilia-Large datasets, significantly expanding available resources for speech generation.
Findings
Models trained on Emilia produce more human-like, spontaneous speech.
Emilia-trained models match traditional datasets in intelligibility.
Scaling dataset size improves speech generation quality.
Abstract
Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗nineninesix/kani-tts-400m-0.3-ptmodel· 293 dl· ♡ 12293 dl♡ 12
- 🤗nineninesix/kani-tts-400m-kymodel· 134 dl· ♡ 4134 dl♡ 4
- 🤗nineninesix/kani-tts-400m-enmodel· 14k dl· ♡ 3914k dl♡ 39
- 🤗nineninesix/kani-tts-400m-armodel· 826 dl· ♡ 5826 dl♡ 5
- 🤗nineninesix/kani-tts-400m-esmodel· 219 dl· ♡ 1219 dl♡ 1
- 🤗nineninesix/kani-tts-400m-demodel· 190 dl· ♡ 2190 dl♡ 2
- 🤗nineninesix/kani-tts-400m-zhmodel· 67 dl· ♡ 167 dl♡ 1
- 🤗nineninesix/kani-tts-400m-komodel· 30 dl· ♡ 630 dl♡ 6
- 🤗nineninesix/kani-tts-400m-ky-kanimodel· 8 dl· ♡ 18 dl♡ 1
- 🤗Mungert/kani-tts-400m-en-GGUFmodel· 110 dl110 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems
