Delay-penalized transducer for low-latency streaming ASR

Wei Kang; Zengwei Yao; Fangjun Kuang; Liyong Guo; Xiaoyu Yang; Long; lin; Piotr \.Zelasko; Daniel Povey

arXiv:2211.00490·eess.AS·November 2, 2022·1 cites

Delay-penalized transducer for low-latency streaming ASR

Wei Kang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Xiaoyu Yang, Long, lin, Piotr \.Zelasko, Daniel Povey

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simple delay-penalized transducer method for streaming ASR that effectively reduces symbol delay while maintaining recognition accuracy, offering a more justified approach than previous methods.

Contribution

It proposes a novel delay penalty technique for transducer models that balances delay and accuracy without relying on external alignments.

Findings

01

Significantly reduces symbol delay in streaming ASR models.

02

Achieves similar delay-accuracy trade-off to FastEmit with better theoretical justification.

03

Applicable to both Conformer and LSTM models.

Abstract

In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

k2-fsa/k2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing