Towards Fine-Grained Code-Switch Speech Translation with Semantic Space Alignment
Yan Gao, Yazheng Yang, Zhibin Lan, Yidong Chen, Min Zhang, Daimeng Wei, Derek F. Wong, Jinsong Su

TL;DR
This paper introduces a novel approach to code-switch speech translation by enhancing Large Language Models with a Mixture-of-Experts speech projector, improving semantic modeling and translation accuracy in multilingual scenarios.
Contribution
It proposes a Mixture-of-Experts speech projector with language-specific training and a multi-stage paradigm to improve code-switch speech translation performance.
Findings
Achieved an average of 0.86 BLEU and 0.93 COMET improvements over SeamlessM4T.
Demonstrated effectiveness across multiple datasets and test sets.
Enhanced semantic space alignment improves translation in code-switch scenarios.
Abstract
Code-switching (CS) speech translation (ST) aims to translate speech that alternates between multiple languages into a target language text, posing significant challenges due to the complexity of semantic modeling and the scarcity of CS data. Previous studies mainly rely on the models themselves to implicitly learn semantic representations and resort to costly manual annotations. To mitigate these limitations, we propose enhancing Large Language Models (LLMs) with a Mixture-of-Experts (MoE) speech projector composed of language expert groups, where each group specializes in the semantic space of a specific language for fine-grained speech feature modeling. A language-specific loss and an intra-group load balancing loss are jointly introduced to guide efficient token routing across and within expert groups. Furthermore, we introduce a multi-stage training paradigm that utilizes readily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
