The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
Tien-Hong Lo, Fu-An Chao, Shi-Yan Weng, Berlin Chen

TL;DR
This paper presents the NTNU ASR system for the challenging Interspeech 2020 Non-Native Children's Speech ASR Challenge, utilizing CNN-TDNNF models, data augmentation, and RNN language model rescoring to achieve competitive results.
Contribution
The paper introduces a robust ASR system for non-native children's speech using CNN-TDNNF models, advanced data augmentation, and language model rescoring under limited data conditions.
Findings
Achieved a 17.59% WER, ranking second in the challenge.
Effective use of data augmentation strategies improved recognition accuracy.
System outperformed the baseline and demonstrated robustness in challenging conditions.
Abstract
This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this under-resourced issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
