Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs

Hunzalah Hassan Bhatti; Firoj Alam; Shammur Absar Chowdhury

arXiv:2601.12494·cs.SD·March 24, 2026

Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs

Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury

PDF

Open Access

TL;DR

This paper investigates multi-task instruction tuning for Arabic audio LLMs in low-resource settings, introducing a new speech summarization dataset and comparing training strategies to optimize performance across diverse speech tasks.

Contribution

It introduces AraMega-SSum, a novel Arabic speech summarization dataset, and evaluates four training strategies, proposing a two-stage TPC->ADS approach for balanced multi-task learning.

Findings

01

ADS accelerates early convergence and enhances paralinguistic tasks

02

Two-stage TPC->ADS offers the best overall task balance

03

Trade-off observed between training efficiency and robustness

Abstract

Audio large language models (LLMs) enable unified speech understanding and generation, but adapting them to linguistically complex and dialect-rich settings such as Arabic-English remains challenging. We present a controlled study of multi-task instruction tuning for an Arabic-centric audio LLM across generative tasks including ASR and speech and text summarization, and discriminative tasks including dialect and emotion recognition, in a resource-constrained setting. To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs. We compare four training strategies (i) Uniform Task Mixing, (ii) Task-Progressive Curriculum (TPC), (iiii) Aligner-Based Diverse Sampling (ADS) for training-time batch construction, and (iv) A two-stage TPC->ADS strategy. Our results show a clear…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing