Minimum Latency Training Strategies for Streaming Sequence-to-Sequence   ASR

Hirofumi Inaguma; Yashesh Gaur; Liang Lu; Jinyu Li; Yifan Gong

arXiv:2004.05009·cs.CL·May 18, 2020·1 cites

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Hirofumi Inaguma, Yashesh Gaur, Liang Lu, Jinyu Li, Yifan Gong

PDF

Open Access

TL;DR

This paper introduces training strategies leveraging external alignments to reduce latency in streaming sequence-to-sequence speech recognition models, improving real-time performance without sacrificing accuracy.

Contribution

It proposes novel training methods using external alignments, including multi-task learning, pre-training, alignment path pruning, and latency loss minimization, to reduce inference latency in streaming ASR models.

Findings

01

Significant latency reduction demonstrated on Cortana voice search task.

02

Improved recognition accuracy in certain scenarios due to latency reduction techniques.

03

Analysis provided on the behavior of streaming S2S models with proposed strategies.

Abstract

Recently, a few novel streaming attention-based sequence-to-sequence (S2S) models have been proposed to perform online speech recognition with linear-time decoding complexity. However, in these models, the decisions to generate tokens are delayed compared to the actual acoustic boundaries since their unidirectional encoders lack future information. This leads to an inevitable latency during inference. To alleviate this issue and reduce latency, we propose several strategies during training by leveraging external hard alignments extracted from the hybrid model. We investigate to utilize the alignments in both the encoder and the decoder. On the encoder side, (1) multi-task learning and (2) pre-training with the framewise classification task are studied. On the decoder side, we (3) remove inappropriate alignment paths beyond an acceptable latency during the alignment marginalization, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing