Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition   With Emformer

Jingyu Sun; Guiping Zhong; Dinghao Zhou; Baoxiang Li

arXiv:2203.15613·cs.SD·March 30, 2022

Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer

Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li

PDF

Open Access

TL;DR

This paper introduces a dynamic latency training method and an augment memory transformer for streaming speech recognition, significantly improving accuracy while managing latency and computational efficiency.

Contribution

It proposes a novel dynamic latency training approach combined with augment memory transformers to enhance streaming ASR performance with variable latency support.

Findings

01

Achieved 6.0% relative WER reduction on test-clean

02

Reduced computational complexity via caching mechanisms

03

Supported low and high latency inference simultaneously

Abstract

An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Advanced Data Compression Techniques

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing · Multi-Head Attention