Decoupling recognition and transcription in Mandarin ASR
Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang,, Kenneth Church

TL;DR
This paper proposes a two-step approach for Mandarin ASR by decoupling recognition and transcription, achieving state-of-the-art accuracy on the Aishell-1 dataset.
Contribution
It introduces a novel factorization of Mandarin ASR into audio-to-Pinyin and Pinyin-to-Hanzi tasks, improving recognition accuracy.
Findings
Achieved 3.9% CER on Aishell-1 dataset
Outperforms previous end-to-end methods
Demonstrates effectiveness of decoupling recognition and transcription
Abstract
Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
