End-to-end Speech Recognition with Adaptive Computation Steps
Mohan Li, Min Liu, Masanori Hattori

TL;DR
This paper introduces the Adaptive Computation Steps (ACS) algorithm for end-to-end speech recognition, allowing models to dynamically determine processing frames for online and offline recognition, improving accuracy and efficiency.
Contribution
The paper proposes the ACS algorithm that enables dynamic frame processing in speech recognition models, diverging from attention-based methods and supporting online recognition with bidirectional context utilization.
Findings
Achieves 31.2% CER online, better than attention-based 32.4%
Attains 18.7% CER offline, outperforming attention-based 22.0%
Demonstrates effectiveness on Mandarin AIShell-1 corpus
Abstract
In this paper, we present Adaptive Computation Steps (ACS) algo-rithm, which enables end-to-end speech recognition models to dy-namically decide how many frames should be processed to predict a linguistic output. The model that applies ACS algorithm follows the encoder-decoder framework, while unlike the attention-based mod-els, it produces alignments independently at the encoder side using the correlation between adjacent frames. Thus, predictions can be made as soon as sufficient acoustic information is received, which makes the model applicable in online cases. Besides, a small change is made to the decoding stage of the encoder-decoder framework, which allows the prediction to exploit bidirectional contexts. We verify the ACS algorithm on a Mandarin speech corpus AIShell-1, and it achieves a 31.2% CER in the online occasion, compared to the 32.4% CER of the attention-based model. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
