Attention-based gated scaling adaptative acoustic model for ctc-based   speech recognition

Fenglin Ding; Wu Guo; Lirong Dai; Jun Du

arXiv:1912.13307·eess.AS·January 1, 2020

Attention-based gated scaling adaptative acoustic model for ctc-based speech recognition

Fenglin Ding, Wu Guo, Lirong Dai, Jun Du

PDF

Open Access

TL;DR

This paper introduces an attention-based gated scaling method for CTC-based speech recognition, significantly improving accuracy without extra speaker info, achieving state-of-the-art results on Mandarin AISHELL-1.

Contribution

The paper presents a novel AGS scheme that enhances deep feature learning in acoustic models using attention-based gating, trained jointly without second-pass or speaker data.

Findings

01

Achieved 7.94% CER on AISHELL-1 dataset.

02

First end-to-end framework to reach this accuracy on AISHELL-1.

03

Demonstrated effectiveness of attention-based gating in acoustic modeling.

Abstract

In this paper, we propose a novel adaptive technique that uses an attention-based gated scaling (AGS) scheme to improve deep feature learning for connectionist temporal classification (CTC) acoustic modeling. In AGS, the outputs of each hidden layer of the main network are scaled by an auxiliary gate matrix extracted from the lower layer by using attention mechanisms. Furthermore, the auxiliary AGS layer and the main network are jointly trained without requiring second-pass model training or additional speaker information, such as speaker code. On the Mandarin AISHELL-1 datasets, the proposed AGS yields a 7.94% character error rate (CER). To the best of our knowledge, this result is the best recognition accuracy achieved on this dataset by using an end-to-end framework.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing