Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages

Seraphina Fong; Marco Matassoni; Alessio Brutti

arXiv:2508.05149·eess.AS·August 8, 2025

Speech LLMs in Low-Resource Scenarios: Data Volume Requirements and the Impact of Pretraining on High-Resource Languages

Seraphina Fong, Marco Matassoni, Alessio Brutti

PDF

TL;DR

This paper explores how Speech LLMs perform in low-resource scenarios, emphasizing data volume needs and how pretraining on high-resource languages can improve recognition accuracy with limited data.

Contribution

It introduces the SLAM-ASR framework and demonstrates that pretraining on high-resource languages mitigates data scarcity issues in low-resource speech recognition.

Findings

01

Pretraining reduces data requirements for low-resource ASR.

02

Multilingual projectors improve performance in low-resource settings.

03

Insights for optimizing Speech LLMs for multilingual and low-resource languages.

Abstract

Large language models (LLMs) have demonstrated potential in handling spoken inputs for high-resource languages, reaching state-of-the-art performance in various tasks. However, their applicability is still less explored in low-resource settings. This work investigates the use of Speech LLMs for low-resource Automatic Speech Recognition using the SLAM-ASR framework, where a trainable lightweight projector connects a speech encoder and a LLM. Firstly, we assess training data volume requirements to match Whisper-only performance, re-emphasizing the challenges of limited data. Secondly, we show that leveraging mono- or multilingual projectors pretrained on high-resource languages reduces the impact of data scarcity, especially with small training sets. Using multilingual LLMs (EuroLLM, Salamandra) with whisper-large-v3-turbo, we evaluate performance on several public benchmarks, providing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.