Phonetic-assisted Multi-Target Units Modeling for Improving   Conformer-Transducer ASR system

Li Li; Dongxing Xu; Haoran Wei; Yanhua Long

arXiv:2211.01571·eess.AS·July 10, 2023

Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system

Li Li, Dongxing Xu, Haoran Wei, Yanhua Long

PDF

Open Access

TL;DR

This paper introduces a phonetic-assisted multi-target units modeling approach that enhances Conformer-Transducer ASR systems by integrating phonetic and text-induced units, leading to significant WER reductions across standard and accented speech datasets.

Contribution

It proposes a novel PMU framework combining PASM and BPE units with new training architectures to improve end-to-end speech recognition performance.

Findings

01

Significant WER reductions on LibriSpeech and accented ASR tasks.

02

PMU outperforms conventional BPE in target modeling.

03

Effective multi-task training frameworks improve recognition accuracy.

Abstract

Exploiting effective target modeling units is very important and has always been a concern in end-to-end automatic speech recognition (ASR). In this work, we propose a phonetic-assisted multi target units (PMU) modeling approach, to enhance the Conformer-Transducer ASR system in a progressive representation learning manner. Specifically, PMU first uses the pronunciation-assisted subword modeling (PASM) and byte pair encoding (BPE) to produce phonetic-induced and text-induced target units separately; Then, three new frameworks are investigated to enhance the acoustic encoder, including a basic PMU, a paraCTC and a pcaCTC, they integrate the PASM and BPE units at different levels for CTC and transducer multi-task training. Experiments on both LibriSpeech and accented ASR tasks show that, the proposed PMU significantly outperforms the conventional BPE, it reduces the WER of LibriSpeech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsByte Pair Encoding