Streaming end-to-end speech recognition with jointly trained neural feature enhancement
Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, Seongkyu Mun, and, Changwoo Han

TL;DR
This paper introduces a streaming end-to-end speech recognition model that jointly trains with enhancement layers, employing curriculum-inspired strategies to improve accuracy in noisy conditions.
Contribution
It proposes GAEF and GREL training strategies to effectively train MoCha-based models for noisy speech recognition, addressing training sensitivity issues.
Findings
Significant accuracy improvements over conventional methods.
Effective handling of noisy and far-field speech conditions.
Enhanced model robustness through curriculum-inspired training.
Abstract
In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based model for clean speech drops significantly when a multi-style training approach is applied. Inspired by Curriculum Learning [1], we introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL). With GAEF, the model is initially trained using clean features. Subsequently, the portion of outputs from the enhancement layers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
