Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR
Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li, Yifan Gong

TL;DR
This paper introduces training strategies leveraging external alignments to reduce latency in streaming sequence-to-sequence speech recognition models, improving real-time performance without sacrificing accuracy.
Contribution
It proposes novel training methods using external alignments, including multi-task learning, pre-training, alignment path pruning, and latency loss minimization, to reduce inference latency in streaming ASR models.
Findings
Significant latency reduction demonstrated on Cortana voice search task.
Improved recognition accuracy in certain scenarios due to latency reduction techniques.
Analysis provided on the behavior of streaming S2S models with proposed strategies.
Abstract
Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
