BASPRO: a balanced script producer for speech corpus collection based on the genetic algorithm
Yu-Wen Chen, Hsin-Min Wang, Yu Tsao

TL;DR
This paper introduces BASPRO, a genetic algorithm-based system for automatically generating phonetically balanced Chinese speech scripts, improving speech processing model performance by providing more representative training data.
Contribution
BASPRO is the first system to automatically produce phonetically balanced Chinese speech scripts using genetic algorithms, enhancing speech corpus quality for model training.
Findings
The generated script covers 84% of real-world syllables.
The syllable distribution closely matches real-world data with 0.96 cosine similarity.
Models trained on the designed corpus outperform those trained on random data.
Abstract
The performance of speech-processing models is heavily influenced by the speech corpus that is used for training and evaluation. In this study, we propose BAlanced Script PROducer (BASPRO) system, which can automatically construct a phonetically balanced and rich set of Chinese sentences for collecting Mandarin Chinese speech data. First, we used pretrained natural language processing systems to extract ten-character candidate sentences from a large corpus of Chinese news texts. Then, we applied a genetic algorithm-based method to select 20 phonetically balanced sentence sets, each containing 20 sentences, from the candidate sentences. Using BASPRO, we obtained a recording script called TMNews, which contains 400 ten-character sentences. TMNews covers 84% of the syllables used in the real world. Moreover, the syllable distribution has 0.96 cosine similarity to the real-world syllable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
