Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin   Speech Recognition with a Syllable-to-Character Converter

Xiong Wang; Zhuoyuan Yao; Xian Shi; Lei Xie

arXiv:2011.08469·cs.SD·November 18, 2020

Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter

Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie

PDF

Open Access

TL;DR

This paper introduces a cascade RNN-T model for Mandarin speech recognition that converts audio to syllables and then to characters, significantly improving recognition accuracy by leveraging extensive text data while maintaining low latency.

Contribution

It proposes a novel cascade RNN-T architecture that enhances language modeling in Mandarin ASR by combining syllable-to-character conversion, enabling better use of text data.

Findings

01

Outperforms character-based RNN-T on Mandarin test sets

02

Achieves higher recognition quality with similar latency

03

Effectively leverages large text repositories for language modeling

Abstract

End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing