Heterogeneity-Aware Dataset Scheduling for Efficient Audio Large Language Model Training
Yanru Wu, Jianning Wang, Chongxin Gan, Yang Li

TL;DR
This paper introduces Grouped Sequential Training (GST), a novel dataset scheduling method that improves convergence speed and performance in training Audio Large Language Models across heterogeneous datasets.
Contribution
GST is a new affinity-aware, progressive dataset scheduling approach that effectively manages heterogeneity during ALLM training, enhancing efficiency and results.
Findings
GST achieves 30-40% faster convergence than standard parallel training.
GST maintains or surpasses the performance of uniform mixture training.
Gradient-based affinity metrics enable scalable dataset relationship estimation.
Abstract
Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impact, how to explicitly manage this heterogeneity during training remains underexplored, with current practices relying primarily on uniform mixture. In this work, we analyze multi-dataset AudioQA training from a convergence perspective and propose Grouped Sequential Training (GST). GST strategically organizes datasets into affinity-aware groups and introduces them via a progressive scheduling protocol, effectively balancing the stability of parallel training with the efficiency of sequential optimization. To ensure scalability, we develop gradient-based affinity metrics that capture inter-dataset relationships…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
