TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

Kentaro Seki; Shinnosuke Takamichi; Takaaki Saeki; Hiroshi Saruwatari

arXiv:2506.15614·cs.SD·November 12, 2025

TTSOps: A Closed-Loop Corpus Optimization Framework for Training Multi-Speaker TTS Models from Dark Data

Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

PDF

Open Access

TL;DR

TTSOps is an automated, closed-loop framework that constructs multi-speaker TTS systems from noisy web data by dynamically selecting and cleansing data based on model impact, enhancing naturalness and diversity.

Contribution

It introduces a novel data-centric pipeline that jointly optimizes data selection and cleansing for training robust multi-speaker TTS models from dark data.

Findings

01

Outperforms baseline methods in naturalness of speech

02

Increases speaker diversity in synthesized speech

03

Effective in noisy, uncurated web data environments

Abstract

This paper presents TTSOps, a fully automated closed-loop framework for constructing multi-speaker text-to-speech (TTS) systems from noisy, uncurated web-scale speech data, often referred to as ``dark data,'' such as online videos. Conventional TTS training pipelines require well-curated corpora with high acoustic quality and accurate text-speech alignment, which severely limits scalability, speaker diversity, and real-world applicability. While recent studies have proposed acoustic-quality-based data selection techniques, they often overlook two critical aspects: (1) the inherent robustness of modern TTS models to noise, and (2) the potential contribution of perceptually low-quality yet informative samples. To address these issues, TTSOps introduces a data-centric training pipeline that integrates three core components: (1) automated data collection from dark data sources, (2)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis