Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer
Jingyu Sun, Guiping Zhong, Dinghao Zhou, Baoxiang Li

TL;DR
This paper introduces a dynamic latency training method and an augment memory transformer for streaming speech recognition, significantly improving accuracy while managing latency and computational efficiency.
Contribution
It proposes a novel dynamic latency training approach combined with augment memory transformers to enhance streaming ASR performance with variable latency support.
Findings
Achieved 6.0% relative WER reduction on test-clean
Reduced computational complexity via caching mechanisms
Supported low and high latency inference simultaneously
Abstract
An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Advanced Data Compression Techniques
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Softmax · Dropout · Position-Wise Feed-Forward Layer · Dense Connections · Byte Pair Encoding · Label Smoothing · Multi-Head Attention
