Enhancing CTC-based speech recognition with diverse modeling units

Shiyi Han; Zhihong Lei; Mingbin Xu; Xingyu Na; Zhen Huang

arXiv:2406.03274·eess.AS·June 12, 2024

Enhancing CTC-based speech recognition with diverse modeling units

Shiyi Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang

PDF

Open Access

TL;DR

This paper proposes a joint training method for end-to-end speech recognition models that integrates diverse modeling units like phonemes and graphemes, leading to improved accuracy and insights into model synergy.

Contribution

It introduces an efficient joint training approach combining different modeling units, enhancing ASR performance beyond system combination effects.

Findings

01

Joint training with diverse units improves accuracy

02

Synergistic use of phoneme and grapheme models enhances robustness

03

Provides insights into optimal heterogeneous unit integration

Abstract

In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Advanced Data Compression Techniques

MethodsALIGN