CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for   Code-Switching Speech Recognition

He Wang; Xucheng Wan; Naijun Zheng; Kai Liu; Huan Zhou; Guojian Li,; Lei Xie

arXiv:2412.12760·cs.SD·January 10, 2025

CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition

He Wang, Xucheng Wan, Naijun Zheng, Kai Liu, Huan Zhou, Guojian Li,, Lei Xie

PDF

Open Access

TL;DR

CAMEL introduces a novel cross-attention-based method to enhance language-specific speech representations and incorporate language bias, significantly improving code-switching speech recognition accuracy across multiple datasets.

Contribution

The paper proposes CAMEL, a cross-attention enhanced MoE and language bias approach, advancing beyond simple fusion methods for better code-switching ASR performance.

Findings

01

Achieves state-of-the-art results on SEAME, ASRU200, and ASRU700+LibriSpeech460 datasets.

02

Effectively models language-specific speech representations with cross-attention.

03

Incorporates language bias from the LD decoder to improve transcription accuracy.

Abstract

Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse languagespecific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems