TL;DR
This paper explores multimodal in-context learning with speech LLMs to improve ASR for low-resource and unseen languages, demonstrating effective cross-lingual transfer and interpretability of MICL mechanisms.
Contribution
It introduces the use of multimodal in-context learning with speech LLMs for unseen languages, showing improvements over traditional prompt-based ASR and analyzing underlying attention patterns.
Findings
MICL is effective for unseen languages using speech and text modalities.
Cross-lingual transfer enhances MICL efficiency without target-language training.
MICL improves ASR performance and outperforms corpus-trained models in low-resource settings.
Abstract
Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
