End-to-end acoustic modelling for phone recognition of young readers
Lucile Gelin, Morgane Daniel, Julien Pinquier, Thomas Pellegrini

TL;DR
This paper explores end-to-end acoustic models for young child phone recognition, demonstrating that transfer learning with Transformer+CTC significantly improves accuracy despite limited data, and analyzing model performance on different reading tasks.
Contribution
It introduces an effective transfer learning approach for child speech recognition using Transformer+CTC, achieving state-of-the-art results with limited data and analyzing model behavior on reading tasks.
Findings
Transformer+CTC achieves 28.1% PER, outperforming DNN-HMM by 6.6%
Transfer learning enhances model performance with limited child speech data
CTC constrains attention to be monotonic, aiding mistake detection
Abstract
Automatic recognition systems for child speech are lagging behind those dedicated to adult speech in the race of performance. This phenomenon is due to the high acoustic and linguistic variability present in child speech caused by their body development, as well as the lack of available child speech data. Young readers speech additionally displays peculiarities, such as slow reading rate and presence of reading mistakes, that hardens the task. This work attempts to tackle the main challenges in phone acoustic modelling for young child speech with limited data, and improve understanding of strengths and weaknesses of a wide selection of model architectures in this domain. We find that transfer learning techniques are highly efficient on end-to-end architectures for adult-to-child adaptation with a small amount of child speech data. Through transfer learning, a Transformer model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Attention Is All You Need · Layer Normalization · Residual Connection · Dropout · Adam · Label Smoothing
