Decoupling recognition and transcription in Mandarin ASR

Jiahong Yuan; Xingyu Cai; Dongji Gao; Renjie Zheng; Liang Huang,; Kenneth Church

arXiv:2108.01129·cs.CL·August 4, 2021·1 cites

Decoupling recognition and transcription in Mandarin ASR

Jiahong Yuan, Xingyu Cai, Dongji Gao, Renjie Zheng, Liang Huang,, Kenneth Church

PDF

Open Access

TL;DR

This paper proposes a two-step approach for Mandarin ASR by decoupling recognition and transcription, achieving state-of-the-art accuracy on the Aishell-1 dataset.

Contribution

It introduces a novel factorization of Mandarin ASR into audio-to-Pinyin and Pinyin-to-Hanzi tasks, improving recognition accuracy.

Findings

01

Achieved 3.9% CER on Aishell-1 dataset

02

Outperforms previous end-to-end methods

03

Demonstrates effectiveness of decoupling recognition and transcription

Abstract

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques