LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

Zhiyuan Zhao; Lijian Lin; Ye Zhu; Kai Xie; Yunfei Liu; Yu Li

arXiv:2601.04233·cs.SD·January 9, 2026

LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li

PDF

Open Access 2 Models 2 Datasets

TL;DR

LEMAS is a comprehensive, large-scale multilingual speech dataset with 150,000 hours of annotated audio, enabling robust generative speech models for synthesis and editing across ten languages.

Contribution

The paper introduces the LEMAS-Dataset, the largest open-source multilingual speech corpus with word-level timestamps, and demonstrates its effectiveness through two novel benchmark models for speech synthesis and editing.

Findings

01

Models trained on LEMAS achieve high-quality multilingual synthesis.

02

The dataset enables effective speech editing with natural transitions.

03

Accent-adversarial training improves cross-lingual synthesis stability.

Abstract

We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques