Efficient Training for Cross-lingual Speech Language Models

Yan Zhou; Qingkai Fang; Yun Hong; Yang Feng

arXiv:2604.11096·cs.CL·April 14, 2026

Efficient Training for Cross-lingual Speech Language Models

Yan Zhou, Qingkai Fang, Yun Hong, Yang Feng

PDF

1 Repo 2 Models 1 Datasets

TL;DR

This paper presents CSLM, an efficient cross-lingual speech language model that aligns speech and text modalities across multiple languages using a novel training strategy, improving generation quality and scalability.

Contribution

It introduces a new alignment method for speech and text in multiple languages, enabling effective cross-lingual speech LLMs without extensive speech data.

Findings

01

CSLM demonstrates strong cross-modal alignment capabilities.

02

It improves generation quality and reduces latency.

03

Exhibits good scalability across languages.

Abstract

Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ictnlp/CSLM
github

Models

Datasets

ICTNLP/BELLE-eval-S2S
dataset· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.