Streaming end-to-end speech recognition with jointly trained neural   feature enhancement

Chanwoo Kim; Abhinav Garg; Dhananjaya Gowda; Seongkyu Mun; and; Changwoo Han

arXiv:2105.01254·cs.SD·May 5, 2021

Streaming end-to-end speech recognition with jointly trained neural feature enhancement

Chanwoo Kim, Abhinav Garg, Dhananjaya Gowda, Seongkyu Mun, and, Changwoo Han

PDF

Open Access

TL;DR

This paper introduces a streaming end-to-end speech recognition model that jointly trains with enhancement layers, employing curriculum-inspired strategies to improve accuracy in noisy conditions.

Contribution

It proposes GAEF and GREL training strategies to effectively train MoCha-based models for noisy speech recognition, addressing training sensitivity issues.

Findings

01

Significant accuracy improvements over conventional methods.

02

Effective handling of noisy and far-field speech conditions.

03

Enhanced model robustness through curriculum-inspired training.

Abstract

In this paper, we present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers. Even though the MoCha attention enables streaming speech recognition with recognition accuracy comparable to a full attention-based approach, training this model is sensitive to various factors such as the difficulty of training examples, hyper-parameters, and so on. Because of these issues, speech recognition accuracy of a MoCha-based model for clean speech drops significantly when a multi-style training approach is applied. Inspired by Curriculum Learning [1], we introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL). With GAEF, the model is initially trained using clean features. Subsequently, the portion of outputs from the enhancement layers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing