An Online Attention-based Model for Speech Recognition
Ruchao Fan, Pan Zhou, Wei Chen, Jia Jia, Gang Liu

TL;DR
This paper introduces an online attention-based speech recognition model that balances real-time processing with high accuracy by combining latency-controlled bidirectional encoders and adaptive monotonic chunk-wise attention.
Contribution
The authors propose a novel online LAS model using latency-controlled bidirectional encoders and adaptive monotonic attention, reducing delay and enabling real-time speech recognition.
Findings
Achieved only 3.5% relative performance reduction compared to offline LAS.
Successfully developed an online LAS model suitable for real-time speech recognition.
Demonstrated effectiveness on an internal Mandarin speech corpus.
Abstract
Attention-based end-to-end models such as Listen, Attend and Spell (LAS), simplify the whole pipeline of traditional automatic speech recognition (ASR) systems and become popular in the field of speech recognition. In previous work, researchers have shown that such architectures can acquire comparable results to state-of-the-art ASR systems, especially when using a bidirectional encoder and global soft attention (GSA) mechanism. However, bidirectional encoder and GSA are two obstacles for real-time speech recognition. In this work, we aim to stream LAS baseline by removing the above two obstacles. On the encoder side, we use a latency-controlled (LC) bidirectional structure to reduce the delay of forward computation. Meanwhile, an adaptive monotonic chunk-wise attention (AMoChA) mechanism is proposed to replace GSA for the calculation of attention weight distribution. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling
