Phonetic-assisted Multi-Target Units Modeling for Improving Conformer-Transducer ASR system
Li Li, Dongxing Xu, Haoran Wei, Yanhua Long

TL;DR
This paper introduces a phonetic-assisted multi-target units modeling approach that enhances Conformer-Transducer ASR systems by integrating phonetic and text-induced units, leading to significant WER reductions across standard and accented speech datasets.
Contribution
It proposes a novel PMU framework combining PASM and BPE units with new training architectures to improve end-to-end speech recognition performance.
Findings
Significant WER reductions on LibriSpeech and accented ASR tasks.
PMU outperforms conventional BPE in target modeling.
Effective multi-task training frameworks improve recognition accuracy.
Abstract
Exploiting effective target modeling units is very important and has always been a concern in end-to-end automatic speech recognition (ASR). In this work, we propose a phonetic-assisted multi target units (PMU) modeling approach, to enhance the Conformer-Transducer ASR system in a progressive representation learning manner. Specifically, PMU first uses the pronunciation-assisted subword modeling (PASM) and byte pair encoding (BPE) to produce phonetic-induced and text-induced target units separately; Then, three new frameworks are investigated to enhance the acoustic encoder, including a basic PMU, a paraCTC and a pcaCTC, they integrate the PASM and BPE units at different levels for CTC and transducer multi-task training. Experiments on both LibriSpeech and accented ASR tasks show that, the proposed PMU significantly outperforms the conventional BPE, it reduces the WER of LibriSpeech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsByte Pair Encoding
