Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models
Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

TL;DR
This paper introduces a multi-task learning approach that jointly trains acoustic-to-word and hybrid speech recognition models, improving stability and performance without pre-training, and enhancing hybrid models with sequence-level optimization.
Contribution
The paper proposes a novel multi-task training framework that stabilizes acoustic-to-word model training and boosts hybrid model performance, eliminating the need for pre-training initialization.
Findings
Multi-task training improves A2W model stability and accuracy.
Joint training enhances hybrid model performance with sequence-level optimization.
Significant performance gains over baseline models are demonstrated.
Abstract
Acoustic-to-word (A2W) models that allow direct mapping from acoustic signals to word sequences are an appealing approach to end-to-end automatic speech recognition due to their simplicity. However, prior works have shown that modelling A2W typically encounters issues of data sparsity that prevent training such a model directly. So far, pre-training initialization is the only approach proposed to deal with this issue. In this work, we propose to build a shared neural network and optimize A2W and conventional hybrid models in a multi-task manner. Our results show that training an A2W model is much more stable with our multi-task model without pre-training initialization, and results in a significant improvement compared to a baseline model. Experiments also reveal that the performance of a hybrid acoustic model can be further improved when jointly training with a sequence-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
