Efficient Training for Cross-lingual Speech Language Models
Yan Zhou, Qingkai Fang, Yun Hong, Yang Feng

TL;DR
This paper presents CSLM, an efficient cross-lingual speech language model that aligns speech and text modalities across multiple languages using a novel training strategy, improving generation quality and scalability.
Contribution
It introduces a new alignment method for speech and text in multiple languages, enabling effective cross-lingual speech LLMs without extensive speech data.
Findings
CSLM demonstrates strong cross-modal alignment capabilities.
It improves generation quality and reduces latency.
Exhibits good scalability across languages.
Abstract
Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
