Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech   Recognition

Binbin Zhang; Di Wu; Zhuoyuan Yao; Xiong Wang; Fan Yu; Chao Yang,; Liyong Guo; Yaguang Hu; Lei Xie; Xin Lei

arXiv:2012.05481·cs.SD·December 30, 2021·46 cites

Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition

Binbin Zhang, Di Wu, Zhuoyuan Yao, Xiong Wang, Fan Yu, Chao Yang,, Liyong Guo, Yaguang Hu, Lei Xie, Xin Lei

PDF

Open Access 5 Repos

TL;DR

This paper introduces a unified two-pass end-to-end speech recognition model that combines streaming and non-streaming capabilities using a modified conformer architecture and dynamic attention, achieving low latency and improved accuracy.

Contribution

The paper proposes a novel unified model that seamlessly integrates streaming and non-streaming speech recognition within a single framework using dynamic chunk-based attention.

Findings

01

Achieves 5.60% CER reduction on AISHELL-1 non-streaming ASR.

02

Attains 5.42% CER with 640ms latency in streaming ASR.

03

Enables flexible inference latency control by adjusting chunk size.

Abstract

In this paper, we present a novel two-pass approach to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. Our model adopts the hybrid CTC/attention architecture, in which the conformer layers in the encoder are modified. We propose a dynamic chunk-based attention strategy to allow arbitrary right context length. At inference time, the CTC decoder generates n-best hypotheses in a streaming way. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. This efficient rescoring process causes very little sentence-level latency. Our experiments on the open 170-hour AISHELL-1 dataset show that, the proposed method can unify the streaming and non-streaming model simply and efficiently. On the AISHELL-1 test set, our unified model achieves 5.60%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing