Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision
Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng, Haoyang Li, Hexin Liu, Eng Siong Chng

TL;DR
This paper introduces a language-aware distillation method for multilingual speech LLMs that improves instruction-following and question-answering performance using ASR-only supervision and a novel query bank approach.
Contribution
The paper proposes a language-aware distillation technique with a query bank and gating network to enhance multilingual Speech LLMs trained solely on ASR data.
Findings
Achieved 14% improvement over multilingual distillation baselines.
Synthesized Audio-MLQA benchmark for multilingual spoken QA.
Improved Speech LLM performance by 32% on Audio-MLQA.
Abstract
Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
