Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

Shreyas Gopal; Donghang Wu; Ashutosh Anshul; Yeo Yue Heng; Yizhou Peng; Haoyang Li; Hexin Liu; Eng Siong Chng

arXiv:2603.07025·cs.CL·March 10, 2026

Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng, Haoyang Li, Hexin Liu, Eng Siong Chng

PDF

Open Access

TL;DR

This paper introduces a language-aware distillation method for multilingual speech LLMs that improves instruction-following and question-answering performance using ASR-only supervision and a novel query bank approach.

Contribution

The paper proposes a language-aware distillation technique with a query bank and gating network to enhance multilingual Speech LLMs trained solely on ASR data.

Findings

01

Achieved 14% improvement over multilingual distillation baselines.

02

Synthesized Audio-MLQA benchmark for multilingual spoken QA.

03

Improved Speech LLM performance by 32% on Audio-MLQA.

Abstract

Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis