A comparable study of modeling units for end-to-end Mandarin speech   recognition

Wei Zou; Dongwei Jiang; Shuaijiang Zhao; Xiangang Li

arXiv:1805.03832·cs.CL·May 15, 2018·5 cites

A comparable study of modeling units for end-to-end Mandarin speech recognition

Wei Zou, Dongwei Jiang, Shuaijiang Zhao, Xiangang Li

PDF

Open Access

TL;DR

This study compares different modeling units and end-to-end models for Mandarin speech recognition, demonstrating that Chinese characters are effective units and attention models outperform CTC models.

Contribution

It systematically evaluates the performance of phoneme, syllable, and Chinese character units in CTC and attention-based models for Mandarin speech recognition.

Findings

01

Chinese character units yield the best recognition performance.

02

Attention-based models outperform CTC models in accuracy.

03

Chinese character attention model achieves CER of 5.68% on DidiCallcenter.

Abstract

End-To-End speech recognition have become increasingly popular in mandarin speech recognition and achieved delightful performance. Mandarin is a tonal language which is different from English and requires special treatment for the acoustic modeling units. There have been several different kinds of modeling units for mandarin such as phoneme, syllable and Chinese character. In this work, we explore two major end-to-end models: connectionist temporal classification (CTC) model and attention based encoder-decoder model for mandarin speech recognition. We compare the performance of three different scaled modeling units: context dependent phoneme(CDP), syllable with tone and Chinese character. We find that all types of modeling units can achieve approximate character error rate (CER) in CTC model and the performance of Chinese character attention model is better than syllable attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques