Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition
Yuting Yang, Binbin Du, Yuke Li

TL;DR
This paper introduces a multi-level modeling approach for Mandarin speech recognition, combining syllable and character units to improve accuracy, demonstrated by promising results on the AISHELL-1 dataset.
Contribution
It proposes a novel multi-level modeling framework that integrates syllable and character units with an auxiliary task for incremental conversion in Mandarin ASR.
Findings
Achieves CER of 4.1%/4.6% with Conformer/Transformer backbones
Demonstrates improved speech recognition accuracy on AISHELL-1
Validates effectiveness of multi-level units in Mandarin ASR
Abstract
The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units and the decoder block deals with character-level modeling units. To facilitate the incremental conversion from syllable features to character features, we design an auxiliary task that applies cross-entropy (CE) loss to intermediate decoder layers. During inference, the input feature sequences are converted into syllable sequences by the encoder block…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection
