Hierarchical Multitask Learning for CTC-based Speech Recognition
Kalpesh Krishna, Shubham Toshniwal, Karen Livescu

TL;DR
This paper investigates hierarchical multitask learning in CTC-based speech recognition, demonstrating that combining it with pretraining yields significant improvements in word error rates, especially in high-data scenarios.
Contribution
It introduces hierarchical multitask learning for CTC speech recognition and shows its effectiveness, especially when combined with pretraining, compared to standard multitask methods.
Findings
Hierarchical multitask learning improves WER over standard methods.
Pretraining combined with hierarchical multitask yields 3.4% absolute WER reduction.
Effectiveness varies with data resource levels, favoring hierarchical methods in high-resource settings.
Abstract
Previous work has shown that neural encoder-decoder speech recognition can be improved with hierarchical multitask learning, where auxiliary tasks are added at intermediate layers of a deep encoder. We explore the effect of hierarchical multitask learning in the context of connectionist temporal classification (CTC)-based speech recognition, and investigate several aspects of this approach. Consistent with previous work, we observe performance improvements on telephone conversational speech recognition (specifically the Eval2000 test sets) when training a subword-level CTC model with an auxiliary phone loss at an intermediate layer. We analyze the effects of a number of experimental variables (like interpolation constant and position of the auxiliary loss function), performance in lower-resource settings, and the relationship between pretraining and multitask learning. We observe that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
