Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
Semin Kim, Seungjun Chung, Taehong Moon, Sangheon Lee, Minyoung Ahn, Keon Lee, Nam Soo Kim, Jaewoong Cho, Ludwig Schmidt, Kangwook Lee, Dongmin Park

TL;DR
Raon-OpenTTS introduces an open-source, large-scale dataset and TTS models that achieve competitive speech quality and robustness, comparable to proprietary models, fostering reproducibility and further research in TTS.
Contribution
The paper presents Raon-OpenTTS, a large open dataset and diffusion transformer-based TTS models that match state-of-the-art performance using publicly available data.
Findings
Raon-OpenTTS-1B achieves 1.78% WER and 0.749 SIM on Seed-TTS-Eval.
Raon-OpenTTS-1B ranks second in WER and first in SIM among open-weight TTS models.
Raon-OpenTTS outperforms previous models on robustness benchmarks.
Abstract
Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-of-the-art closed-data TTS models, and Raon-OpenTTS-Pool, a large-scale open dataset for reproducible TTS training. Raon-OpenTTS-Pool consists of 615K hours of 240M speech segments aggregated from publicly available English speech corpora and web-sourced recordings. With a model-based filtering pipeline applied to Raon-OpenTTS-Pool, we derive Raon-OpenTTS-Core, a curated, high-quality subset of 510K hours and 194M speech segments. Using Raon-OpenTTS-Core, we train Raon-OpenTTS, a series of diffusion transformer (DiT)-based TTS models from 0.3B to 1B parameters. On multiple benchmarks, Raon-OpenTTS-1B shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
