Attention based on-device streaming speech recognition with large speech   corpus

Kwangyoun Kim; Kyungmin Lee; Dhananjaya Gowda; Junmo Park; Sungsoo; Kim; Sichen Jin; Young-Yoon Lee; Jinsu Yeo; Daehyun Kim; Seokyeong Jung,; Jungin Lee; Myoungji Han; Chanwoo Kim

arXiv:2001.00577·eess.AS·January 6, 2020

Attention based on-device streaming speech recognition with large speech corpus

Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo, Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung,, Jungin Lee, Myoungji Han, Chanwoo Kim

PDF

Open Access

TL;DR

This paper introduces a large-corpus, on-device streaming speech recognition system using monotonic chunk-wise attention, achieving high accuracy, model compression, and effective domain adaptation.

Contribution

It presents a novel on-device ASR system with large-scale training, model compression, and domain adaptation techniques, improving accuracy and efficiency.

Findings

01

Achieved around 90% word recognition rate on general domain.

02

Compressed models by over 3.4 times with minimal accuracy loss.

03

Improved domain-specific WER by 36% through fusion with n-gram models.

Abstract

In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pre-training and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing