Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling
Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N., Sainath, Yonghui Wu, Ruoming Pang

TL;DR
This paper introduces Dual-mode ASR, a unified end-to-end speech recognition framework that enhances streaming ASR accuracy and latency by jointly training with full-context models, applicable to convolutional and transformer architectures.
Contribution
The work presents a novel unified training framework for streaming and full-context ASR, improving performance through weight sharing and knowledge distillation, and achieves state-of-the-art results.
Findings
Significant improvements in latency and accuracy for streaming ASR.
Effective application to both convolutional and transformer-based models.
Achieved new state-of-the-art results on LibriSpeech and MultiDomain datasets.
Abstract
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsKnowledge Distillation
