Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Yuting Yang; Binbin Du; Yuke Li

arXiv:2205.11998·cs.CL·October 19, 2022

Multi-Level Modeling Units for End-to-End Mandarin Speech Recognition

Yuting Yang, Binbin Du, Yuke Li

PDF

Open Access

TL;DR

This paper introduces a multi-level modeling approach for Mandarin speech recognition, combining syllable and character units to improve accuracy, demonstrated by promising results on the AISHELL-1 dataset.

Contribution

It proposes a novel multi-level modeling framework that integrates syllable and character units with an auxiliary task for incremental conversion in Mandarin ASR.

Findings

01

Achieves CER of 4.1%/4.6% with Conformer/Transformer backbones

02

Demonstrates improved speech recognition accuracy on AISHELL-1

03

Validates effectiveness of multi-level units in Mandarin ASR

Abstract

The choice of modeling units is crucial for automatic speech recognition (ASR) tasks. In mandarin scenarios, the Chinese characters represent meaning but are not directly related to the pronunciation. Thus only considering the writing of Chinese characters as modeling units is insufficient to capture speech features. In this paper, we present a novel method involves with multi-level modeling units, which integrates multi-level information for mandarin speech recognition. Specifically, the encoder block considers syllables as modeling units and the decoder block deals with character-level modeling units. To facilitate the incremental conversion from syllable features to character features, we design an auxiliary task that applies cross-entropy (CE) loss to intermediate decoder layers. During inference, the input feature sequences are converted into syllable sequences by the encoder block…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dense Connections · Dropout · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Residual Connection