Cross-Lingual Interleaving for Speech Language Models
Adel Moumen, Guangzhi Sun, Philip C. Woodland

TL;DR
This paper introduces a cross-lingual interleaving method for spoken language models that improves multilingual understanding and cross-lingual capabilities without textual supervision, supported by new datasets and benchmarks.
Contribution
The authors propose a novel cross-lingual interleaving technique for speech models and release new multilingual datasets and benchmarks for evaluation.
Findings
Interleaving improves monolingual semantic accuracy.
Enhances cross-lingual continuation robustness.
Strengthens cross-lingual hidden-state alignment.
Abstract
Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
