LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation   Auxiliary Task for E2E Code-switching ASR

Guodong Ma; Wenxuan Wang; Yuke Li; Yuting Yang; Binbin Du; Haoran Fu

arXiv:2309.16178·cs.SD·October 10, 2023

LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR

Guodong Ma, Wenxuan Wang, Yuke Li, Yuting Yang, Binbin Du, Haoran Fu

PDF

Open Access

TL;DR

This paper introduces LAE-ST-MoE, a novel framework that enhances code-switching ASR by integrating speech translation tasks to leverage contextual language information, resulting in improved accuracy.

Contribution

The paper proposes a new LAE-ST-MoE model that incorporates speech translation tasks into language-aware encoders using a mixture of experts, improving code-switching ASR performance.

Findings

01

Achieves 9.26% reduction in mix error on CS test dataset.

02

Enables speech translation from CS speech to Mandarin or English text.

03

Demonstrates effectiveness of integrating translation tasks into ASR models.

Abstract

Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. It incorporates speech translation (ST) tasks into LAE and utilizes ST to learn the contextual information between different languages. It introduces a task-based mixture of expert modules, employing separate feed-forward networks for the ASR and ST tasks. Experimental results on the ASRU 2019 Mandarin-English CS challenge dataset demonstrate that, compared to the LAE-based CTC, the LAE-ST-MoE model achieves a 9.26% mix error reduction on the CS test with the same decoding parameter. Moreover,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques