Two Stage Contextual Word Filtering for Context bias in Unified   Streaming and Non-streaming Transducer

Zhanheng Yang; Sining Sun; Xiong Wang; Yike Zhang; Long Ma; Lei Xie

arXiv:2301.06735·cs.SD·June 9, 2023

Two Stage Contextual Word Filtering for Context bias in Unified Streaming and Non-streaming Transducer

Zhanheng Yang, Sining Sun, Xiong Wang, Yike Zhang, Long Ma, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a two-stage filtering method for contextual words in a unified streaming and non-streaming end-to-end speech recognition system, significantly improving accuracy and efficiency.

Contribution

It presents a novel approach to generate high-quality contextual lists using phone-level streaming output, enhancing recognition accuracy and inference speed.

Findings

01

Over 20% CER reduction compared to baseline

02

RTF stabilized within 0.15 with 6,000+ contextual words

03

Effective for both streaming and non-streaming ASR

Abstract

It is difficult for an E2E ASR system to recognize words such as entities appearing infrequently in the training data. A widely used method to mitigate this issue is feeding contextual information into the acoustic model. Previous works have proven that a compact and accurate contextual list can boost the performance significantly. In this paper, we propose an efficient approach to obtain a high quality contextual list for a unified streaming/non-streaming based E2E model. Specifically, we make use of the phone-level streaming output to first filter the predefined contextual word list then fuse it into non-casual encoder and decoder to generate the final recognition results. Our approach improve the accuracy of the contextual ASR system and speed up the inference process. Experiments on two datasets demonstrates over 20% CER reduction comparing to the baseline system. Meanwhile, the RTF…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings