Streaming End-to-End Multilingual Speech Recognition with Joint Language   Identification

Chao Zhang; Bo Li; Tara Sainath; Trevor Strohman; Sepand Mavandadi,; Shuo-yiin Chang; Parisa Haghani

arXiv:2209.06058·eess.AS·September 14, 2022

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi,, Shuo-yiin Chang, Parisa Haghani

PDF

Open Access

TL;DR

This paper introduces a streaming multilingual speech recognition model that integrates language identification directly into the RNN-T architecture, achieving high accuracy with minimal additional latency.

Contribution

It proposes a novel RNN-T based model with integrated per-frame language ID predictor for improved streaming multilingual speech recognition.

Findings

01

Achieves 96.2% language ID accuracy

02

Maintains same second-pass WER as oracle LID

03

Operates with low test-time cost

Abstract

Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing