FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
Dancheng Liu, Jinjun Xiong

TL;DR
FASA is a new automatic speech aligner designed to extract high-quality aligned children's speech data from noisy datasets, significantly improving data quality and aiding children's ASR development.
Contribution
The paper introduces FASA, a flexible and automatic forced-alignment tool specifically tailored for children's speech, addressing limitations of existing tools and enhancing data quality.
Findings
FASA improves data quality by 13.6 times over human annotations.
FASA effectively extracts high-quality aligned children's speech data from noisy datasets.
Application on CHILDES dataset demonstrates FASA's practical utility.
Abstract
Automatic Speech Recognition (ASR) for adults' speeches has made significant progress by employing deep neural network (DNN) models recently, but improvement in children's speech is still unsatisfactory due to children's speech's distinct characteristics. DNN models pre-trained on adult data often struggle in generalizing children's speeches with fine tuning because of the lack of high-quality aligned children's speeches. When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data from many of the existing noisy children's speech data. We demonstrate its usage on the CHILDES dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
