Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning
Mostafa Shahin, Beena Ahmed, and Julien Epps

TL;DR
This paper introduces an adversarial multi-task learning approach to develop child speech acoustic models that are invariant to speaker and age variations, improving speech recognition accuracy.
Contribution
It proposes a novel adversarial multi-task training method with shared and discriminative networks to handle high variability in child speech recognition.
Findings
Achieved 13% reduction in WER on OGI speech corpus
Demonstrated effectiveness of adversarial multi-task learning for speaker and age invariance
Improved robustness of child speech recognition systems
Abstract
One of the major challenges in acoustic modelling of child speech is the rapid changes that occur in the children's articulators as they grow up, their differing growth rates and the subsequent high variability in the same age group. These high acoustic variations along with the scarcity of child speech corpora have impeded the development of a reliable speech recognition system for children. In this paper, a speaker- and age-invariant training approach based on adversarial multi-task learning is proposed. The system consists of one generator shared network that learns to generate speaker- and age-invariant features connected to three discrimination networks, for phoneme, age, and speaker. The generator network is trained to minimize the phoneme-discrimination loss and maximize the speaker- and age-discrimination losses in an adversarial multi-task learning fashion. The generator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
