Online Automatic Speech Recognition with Listen, Attend and Spell Model
Roger Hsiao, Dogan Can, Tim Ng, Ruchir Travadi, Arnab Ghoshal

TL;DR
This paper analyzes the limitations of attention-based LAS models in online speech recognition and proposes a simple technique to achieve fully online recognition with high accuracy and low latency, validated in a large-scale deployment.
Contribution
It introduces a novel method enabling fully online LAS speech recognition with improved latency and accuracy, validated through a production deployment.
Findings
Achieves within 4% relative CER of offline LAS in online mode
Operates at 12% lower latency than hybrid models
First large-scale deployment of a fully online LAS model
Abstract
The Listen, Attend and Spell (LAS) model and other attention-based automatic speech recognition (ASR) models have known limitations when operated in a fully online mode. In this paper, we analyze the online operation of LAS models to demonstrate that these limitations stem from the handling of silence regions and the reliability of online attention mechanism at the edge of input buffers. We propose a novel and simple technique that can achieve fully online recognition while meeting accuracy and latency targets. For the Mandarin dictation task, our proposed approach can achieve a character error rate in online operation that is within 4% relative to an offline LAS model. The proposed online LAS model operates at 12% lower latency relative to a conventional neural network hidden Markov model hybrid of comparable accuracy. We have validated the proposed method through a production scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
