IDEAL: Data Equilibrium Adaptation for Multi-Capability Language Model Alignment
Chenlin Ming, Chendi Qu, Mengzhang Cai, Qizhi Pei, Zhuoshi Pan, Yu Li, Xiaoming Duan, Lijun Wu, Conghui He

TL;DR
This paper introduces IDEAL, a gradient-based framework that dynamically balances multi-domain training data volumes to improve large language model capabilities across diverse tasks.
Contribution
IDEAL is a novel adaptive method that optimizes data distribution from multiple domains to enhance multi-capability LLM alignment and performance.
Findings
Achieves approximately 7% improvement in multi-task evaluation scores.
Outperforms uniform data allocation strategies.
Ensures balanced dataset composition for robust generalization.
Abstract
Large Language Models (LLMs) have achieved impressive performance through Supervised Fine-tuning (SFT) on diverse instructional datasets. When training on multiple capabilities simultaneously, the mixture training dataset, governed by volumes of data from different domains, is a critical factor that directly impacts the final model's performance. Unlike many studies that focus on enhancing the quality of training datasets through data selection methods, few works explore the intricate relationship between the compositional quantity of mixture training datasets and the emergent capabilities of LLMs. Given the availability of a high-quality multi-domain training dataset, understanding the impact of data from each domain on the model's overall capabilities is crucial for preparing SFT data and training a well-balanced model that performs effectively across diverse domains. In this work, we…
Peer Reviews
Decision·ICLR 2026 Poster
1. A classic and intriguing research problem, accompanied by an innovative solution. 2. This work provides some theoretical proofs under the premise of a well-formalized description.
1. Dynamic data curation has already been explored in some existing works [1], and it is recommended to introduce the related studies in the discussion. 2. Although the method is optimized for efficiency, the computational overhead is still relatively large, as shown in Table 7. This somewhat undermines the applicability of the method. 3. According to the experimental results in Table 1, at epoch=3, the performance of IDEAL is actually worse than its performance at epoch=1. This, to some exten
1. The paper addresses the critical and practical challenge of determining optimal data mixtures for multi-capability SFT, moving beyond simple heuristics. 2. The proposed IDEAL framework is grounded in optimization theory (bi-level optimization), providing a principled, gradient-based approach to adapt data proportions based on their impact on a reference set performance. 3. The work incorporates practical considerations for LLMs by using techniques like K-FAC to approximate the Hessian, maki
1. The computational overhead remains extremely high, potentially limiting practical utility. Despite approximations, the iterative nature of IDEAL, requiring a full SFT cycle per iteration plus complex gradient calculations involving Hessian approximations, makes it very resource-intensive. The appendix reveals the total time is an order of magnitude higher than standard SFT. 2. The method's effectiveness heavily relies on the accuracy of Hessian approximations (like K-FAC), which can introduc
1. Important Problem Selection. The paper addresses a genuinely important challenge in LLM training - how to balance multiple capabilities during supervised fine-tuning. This is a practical problem that many practitioners face, and finding principled solutions has real value. 2. Systematic Approach. Rather than relying on heuristics or manual tuning, IDEAL provides a systematic, gradient-based framework for optimizing data distributions. The mathematical formulation, while not novel, is reasonab
1. Limited Technical Novelty. The core contribution relies on well-established techniques (influence functions, K-FAC approximation) applied to data mixing. The formulation in Eq. (1) introducing β parameters for data repetition is straightforward, and the bi-level optimization problem (Eq. 2) follows standard approaches. The use of influence functions for data weighting has been extensively explored in prior work, making the technical contribution incremental. 2. Experiment results and setup is
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsFocus · Shrink and Fine-Tune
