Enhancing CTC-based speech recognition with diverse modeling units
Shiyi Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang

TL;DR
This paper proposes a joint training method for end-to-end speech recognition models that integrates diverse modeling units like phonemes and graphemes, leading to improved accuracy and insights into model synergy.
Contribution
It introduces an efficient joint training approach combining different modeling units, enhancing ASR performance beyond system combination effects.
Findings
Joint training with diverse units improves accuracy
Synergistic use of phoneme and grapheme models enhances robustness
Provides insights into optimal heterogeneous unit integration
Abstract
In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsALIGN
