Attention-based gated scaling adaptative acoustic model for ctc-based speech recognition
Fenglin Ding, Wu Guo, Lirong Dai, Jun Du

TL;DR
This paper introduces an attention-based gated scaling method for CTC-based speech recognition, significantly improving accuracy without extra speaker info, achieving state-of-the-art results on Mandarin AISHELL-1.
Contribution
The paper presents a novel AGS scheme that enhances deep feature learning in acoustic models using attention-based gating, trained jointly without second-pass or speaker data.
Findings
Achieved 7.94% CER on AISHELL-1 dataset.
First end-to-end framework to reach this accuracy on AISHELL-1.
Demonstrated effectiveness of attention-based gating in acoustic modeling.
Abstract
In this paper, we propose a novel adaptive technique that uses an attention-based gated scaling (AGS) scheme to improve deep feature learning for connectionist temporal classification (CTC) acoustic modeling. In AGS, the outputs of each hidden layer of the main network are scaled by an auxiliary gate matrix extracted from the lower layer by using attention mechanisms. Furthermore, the auxiliary AGS layer and the main network are jointly trained without requiring second-pass model training or additional speaker information, such as speaker code. On the Mandarin AISHELL-1 datasets, the proposed AGS yields a 7.94% character error rate (CER). To the best of our knowledge, this result is the best recognition accuracy achieved on this dataset by using an end-to-end framework.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
