Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Haorui He; Zengqiang Shang; Chaoren Wang; Xuyuan Li; Yicheng Gu; Hua Hua; Liwei Liu; Chen Yang; Jiaqi Li; Peiyang Shi; Yuancheng Wang; Kai Chen; Pengyuan Zhang; Zhizheng Wu

arXiv:2501.15907·cs.SD·October 9, 2025

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

PDF

Open Access 1 Repo 10 Models 5 Datasets

TL;DR

This paper introduces Emilia, a large, diverse, multilingual speech dataset derived from real-world sources, enabling more natural and spontaneous speech generation models that outperform traditional datasets in realism and diversity.

Contribution

The paper presents Emilia-Pipe, an open-source pipeline for extracting high-quality, spontaneous speech data from in-the-wild sources, and constructs Emilia and Emilia-Large datasets, significantly expanding available resources for speech generation.

Findings

01

Models trained on Emilia produce more human-like, spontaneous speech.

02

Emilia-trained models match traditional datasets in intelligibility.

03

Scaling dataset size improves speech generation quality.

Abstract

Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-mmlab/amphion
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems