Low Latency End-to-End Streaming Speech Recognition with a Scout Network
Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming, Zhou

TL;DR
This paper introduces a low-latency streaming Transformer-based speech recognition model using a scout network to detect word boundaries without future context, significantly reducing latency while maintaining high accuracy.
Contribution
A novel streaming Transformer architecture with a scout network for boundary detection, enabling low-latency recognition without future frame look-ahead.
Findings
Achieves 2.7/6.4 WER on Librispeech test sets.
Operates with only 639 ms latency.
Outperforms previous streaming models in accuracy and speed.
Abstract
The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency on the test-clean and test-other data sets of Librispeech.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
