Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep   Language Posterior Injection

Tzu-Ting Yang; Hsin-Wei Wang; Yi-Cheng Wang; Berlin Chen

arXiv:2412.08651·eess.AS·December 13, 2024

Enhancing Code-Switching ASR Leveraging Non-Peaky CTC Loss and Deep Language Posterior Injection

Tzu-Ting Yang, Hsin-Wei Wang, Yi-Cheng Wang, Berlin Chen

PDF

Open Access

TL;DR

This paper introduces a novel approach to improve code-switching automatic speech recognition by integrating language identification, boundary alignment loss, and deep language posterior interaction, leading to better language handling in E2E ASR systems.

Contribution

It proposes a new method combining LID embedding, boundary alignment loss, and deep language posterior interaction to enhance code-switching ASR performance.

Findings

01

Outperforms prior D-MoE method on SEAME corpus

02

Enriches encoder with detailed language information

03

Improves language handling in end-to-end ASR systems

Abstract

Code-switching-where multilingual speakers alternately switch between languages during conversations-still poses significant challenges to end-to-end (E2E) automatic speech recognition (ASR) systems due to phenomena of both acoustic and semantic confusion. This issue arises because ASR systems struggle to handle the rapid alternation of languages effectively, which often leads to significant performance degradation. Our main contributions are at least threefold: First, we incorporate language identification (LID) information into several intermediate layers of the encoder, aiming to enrich output embeddings with more detailed language information. Secondly, through the novel application of language boundary alignment loss, the subsequent ASR modules are enabled to more effectively utilize the knowledge of internal language posteriors. Third, we explore the feasibility of using language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis