A comparable study of modeling units for end-to-end Mandarin speech recognition
Wei Zou, Dongwei Jiang, Shuaijiang Zhao, Xiangang Li

TL;DR
This study compares different modeling units and end-to-end models for Mandarin speech recognition, demonstrating that Chinese characters are effective units and attention models outperform CTC models.
Contribution
It systematically evaluates the performance of phoneme, syllable, and Chinese character units in CTC and attention-based models for Mandarin speech recognition.
Findings
Chinese character units yield the best recognition performance.
Attention-based models outperform CTC models in accuracy.
Chinese character attention model achieves CER of 5.68% on DidiCallcenter.
Abstract
End-To-End speech recognition have become increasingly popular in mandarin speech recognition and achieved delightful performance. Mandarin is a tonal language which is different from English and requires special treatment for the acoustic modeling units. There have been several different kinds of modeling units for mandarin such as phoneme, syllable and Chinese character. In this work, we explore two major end-to-end models: connectionist temporal classification (CTC) model and attention based encoder-decoder model for mandarin speech recognition. We compare the performance of three different scaled modeling units: context dependent phoneme(CDP), syllable with tone and Chinese character. We find that all types of modeling units can achieve approximate character error rate (CER) in CTC model and the performance of Chinese character attention model is better than syllable attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
