Boosting Code-Switching ASR with Mixture of Experts Enhanced   Speech-Conditioned LLM

Fengrun Zhang; Wang Geng; Hukai Huang; Yahui Shan; Cheng Yi; He Qu

arXiv:2409.15905·cs.SD·November 1, 2024

Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM

Fengrun Zhang, Wang Geng, Hukai Huang, Yahui Shan, Cheng Yi, He Qu

PDF

Open Access

TL;DR

This paper presents a novel speech-conditioned LLM with a Mixture of Experts architecture and an IDIT mechanism to improve code-switching ASR performance, achieving significant improvements over existing models.

Contribution

It introduces a MoE-based connector and IDIT mechanism for better handling of code-switching in speech recognition, along with a two-stage training strategy for enhanced performance.

Findings

01

Outperforms state-of-the-art models in code-switching ASR tasks

02

Effective integration of MoE architecture with LLM for multilingual speech recognition

03

Two-stage training strategy improves model adaptability and accuracy

Abstract

In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). Specifically, we propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently. To further enhance the collaboration of multiple experts and leverage the understanding capabilities of LLM, we propose a two-stage progressive training strategy: 1) The connector is unfrozen and trained with language-specialized experts to map speech representations to the text space. 2) The connector and LLM LoRA adaptor are trained with the proposed IDIT mechanism and all experts are activated to learn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

MethodsMixture of Experts