AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data
Jianwei Yu, Hangting Chen, Yanyao Bian, Xiang Li, Yi Luo, Jinchuan, Tian, Mengyang Liu, Jiayi Jiang, Shuai Wang

TL;DR
AutoPrep is an automated framework that preprocesses in-the-wild speech data to improve quality, generate labels, and transcriptions, facilitating better use of large-scale speech datasets for speech technology applications.
Contribution
The paper introduces AutoPrep, a novel automatic preprocessing framework for in-the-wild speech data, addressing noise, segmentation, and labeling challenges without manual annotation.
Findings
AutoPrep achieves comparable speech quality scores to open-source TTS datasets.
The framework enables high speaker similarity in TTS systems trained on preprocessed data.
AutoPrep effectively automates data cleaning and annotation for large-scale speech datasets.
Abstract
Recently, the utilization of extensive open-sourced text data has significantly advanced the performance of text-based large language models (LLMs). However, the use of in-the-wild large-scale speech data in the speech technology community remains constrained. One reason for this limitation is that a considerable amount of the publicly available speech data is compromised by background noise, speech overlapping, lack of speech segmentation information, missing speaker labels, and incomplete transcriptions, which can largely hinder their usefulness. On the other hand, human annotation of speech data is both time-consuming and costly. To address this issue, we introduce an automatic in-the-wild speech data preprocessing framework (AutoPrep) in this paper, which is designed to enhance speech quality, generate speaker labels, and produce transcriptions automatically. The proposed AutoPrep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
