An Online Attention-based Model for Speech Recognition

Ruchao Fan; Pan Zhou; Wei Chen; Jia Jia; Gang Liu

arXiv:1811.05247·cs.CL·April 26, 2019·6 cites

An Online Attention-based Model for Speech Recognition

Ruchao Fan, Pan Zhou, Wei Chen, Jia Jia, Gang Liu

PDF

Open Access

TL;DR

This paper introduces an online attention-based speech recognition model that balances real-time processing with high accuracy by combining latency-controlled bidirectional encoders and adaptive monotonic chunk-wise attention.

Contribution

The authors propose a novel online LAS model using latency-controlled bidirectional encoders and adaptive monotonic attention, reducing delay and enabling real-time speech recognition.

Findings

01

Achieved only 3.5% relative performance reduction compared to offline LAS.

02

Successfully developed an online LAS model suitable for real-time speech recognition.

03

Demonstrated effectiveness on an internal Mandarin speech corpus.

Abstract

Attention-based end-to-end models such as Listen, Attend and Spell (LAS), simplify the whole pipeline of traditional automatic speech recognition (ASR) systems and become popular in the field of speech recognition. In previous work, researchers have shown that such architectures can acquire comparable results to state-of-the-art ASR systems, especially when using a bidirectional encoder and global soft attention (GSA) mechanism. However, bidirectional encoder and GSA are two obstacles for real-time speech recognition. In this work, we aim to stream LAS baseline by removing the above two obstacles. On the encoder side, we use a latency-controlled (LC) bidirectional structure to reduce the delay of forward computation. Meanwhile, an adaptive monotonic chunk-wise attention (AMoChA) mechanism is proposed to replace GSA for the calculation of attention weight distribution. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling