Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

Yanru Wu; Jianning Wang; Chongxin Gan; Yang Li

arXiv:2605.19101·cs.SD·May 20, 2026

Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training

Yanru Wu, Jianning Wang, Chongxin Gan, Yang Li

PDF

TL;DR

This paper introduces Grouped Sequential Training (GST), a novel dataset scheduling method that improves convergence speed and performance in training Audio Large Language Models across heterogeneous datasets.

Contribution

GST is a new affinity-aware, progressive dataset scheduling approach that effectively manages heterogeneity during ALLM training, enhancing efficiency and results.

Findings

01

GST achieves 30-40% faster convergence than standard parallel training.

02

GST maintains or surpasses the performance of uniform mixture training.

03

Gradient-based affinity metrics enable scalable dataset relationship estimation.

Abstract

Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.