Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs
Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury

TL;DR
This paper investigates multi-task instruction tuning for Arabic audio LLMs in low-resource settings, introducing a new speech summarization dataset and comparing training strategies to optimize performance across diverse speech tasks.
Contribution
It introduces AraMega-SSum, a novel Arabic speech summarization dataset, and evaluates four training strategies, proposing a two-stage TPC->ADS approach for balanced multi-task learning.
Findings
ADS accelerates early convergence and enhances paralinguistic tasks
Two-stage TPC->ADS offers the best overall task balance
Trade-off observed between training efficiency and robustness
Abstract
Audio large language models (LLMs) enable unified speech understanding and generation, but adapting them to linguistically complex and dialect-rich settings such as Arabic-English remains challenging. We present a controlled study of multi-task instruction tuning for an Arabic-centric audio LLM across generative tasks including ASR and speech and text summarization, and discriminative tasks including dialect and emotion recognition, in a resource-constrained setting. To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs. We compare four training strategies (i) Uniform Task Mixing, (ii) Task-Progressive Curriculum (TPC), (iiii) Aligner-Based Diverse Sampling (ADS) for training-time batch construction, and (iv) A two-stage TPC->ADS strategy. Our results show a clear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
