Dual-mode ASR: Unify and Improve Streaming ASR with Full-context   Modeling

Jiahui Yu; Wei Han; Anmol Gulati; Chung-Cheng Chiu; Bo Li; Tara N.; Sainath; Yonghui Wu; Ruoming Pang

arXiv:2010.06030·cs.CL·January 28, 2021·24 cites

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling

Jiahui Yu, Wei Han, Anmol Gulati, Chung-Cheng Chiu, Bo Li, Tara N., Sainath, Yonghui Wu, Ruoming Pang

PDF

Open Access 1 Video

TL;DR

This paper introduces Dual-mode ASR, a unified end-to-end speech recognition framework that enhances streaming ASR accuracy and latency by jointly training with full-context models, applicable to convolutional and transformer architectures.

Contribution

The work presents a novel unified training framework for streaming and full-context ASR, improving performance through weight sharing and knowledge distillation, and achieves state-of-the-art results.

Findings

01

Significant improvements in latency and accuracy for streaming ASR.

02

Effective application to both convolutional and transformer-based models.

03

Achieved new state-of-the-art results on LibriSpeech and MultiDomain datasets.

Abstract

Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible, while full-context ASR waits for the completion of a full speech utterance before emitting completed hypotheses. In this work, we propose a unified framework, Dual-mode ASR, to train a single end-to-end ASR model with shared weights for both streaming and full-context speech recognition. We show that the latency and accuracy of streaming ASR significantly benefit from weight sharing and joint training of full-context ASR, especially with inplace knowledge distillation during the training. The Dual-mode ASR framework can be applied to recent state-of-the-art convolution-based and transformer-based ASR networks. We present extensive experiments with two state-of-the-art ASR networks, ContextNet and Conformer, on two datasets, a widely used public dataset LibriSpeech and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dual-mode ASR: Unify and Improve Streaming ASR with Full-context Modeling· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsKnowledge Distillation