TL;DR
HBO introduces a hierarchical optimization method for fine-tuning large language models, dynamically balancing data across and within datasets to improve training effectiveness and accuracy.
Contribution
The paper presents a novel bilevel optimization approach with global and local actors for adaptive data balancing during LLM fine-tuning.
Findings
HBO outperforms existing baselines across multiple tasks.
Both global and local actors effectively adjust data usage.
Significant accuracy improvements are achieved with HBO.
Abstract
Fine-tuning large language models (LLMs) on a mixture of diverse datasets poses challenges due to data imbalance and heterogeneity. Existing methods often address these issues across datasets (globally) but overlook the imbalance and heterogeneity within individual datasets (locally), which limits their effectiveness. We introduce Hierarchical Balancing Optimization (HBO), a novel method that enables LLMs to autonomously adjust data allocation during fine-tuning both across datasets (globally) and within each individual dataset (locally). HBO employs a bilevel optimization strategy with two types of actors: a Global Actor, which balances data sampling across different subsets of the training mixture, and several Local Actors, which optimizes data usage within each subset based on difficulty levels. These actors are guided by reward functions derived from the LLM's training state, which…
Peer Reviews
Decision·ICLR 2026 Poster
1) HBO effectively addresses both global and local data imbalances, providing a more comprehensive solution to the challenges of fine-tuning LLMs on diverse datasets. 2) The bilevel optimization framework with Global and Local Actors allows for fine-grained control over data sampling, leading to improved model performance across various tasks. 3) Extensive experiments demonstrate HBO's strong applicability across multiple LLM backbones and tasks, consistently outperforming existing baselines a
1) My main concern is the proposed method adds more computations based on MoS. The reinforcement learning framework, as well as some reward, is similar to the MoS method. This work adds more actors and the grad norm reward, more insights of this field could be added. 2) This paper primarily compares three sampling balancing methods: MoS, MultiUAT, and MultiDDS. However, many of the results are similar to uniform sampling. What is the next step of this field could be discussed.
1. The paper tackles a significant and nuanced challenge in LLM fine-tuning by explicitly addressing hierarchical data imbalance and heterogeneity (both global, across datasets, and local, within datasets), which is often overlooked by simpler methods. 2. The proposed HBO mechanism, utilizing a bilevel optimization framework with distinct global and local actors guided by rewards derived from the model's own training state, is a novel and sophisticated approach to achieve autonomous, dynamic da
1. The framework introduces substantial complexity compared to standard fine-tuning or simpler dynamic sampling. Implementing and tuning the bilevel optimization setup, managing multiple actors (one global, potentially many local), and ensuring stable training with the Reinforce algorithm likely requires significant expertise and effort. 2. The reported computational overhead, while quantified (~15%), is non-negligible and could be a barrier to practical adoption. This additional runtime cost
1. The paper targets an important problem, i.e., data imbalance and heterogeneity in LLM fine-tuning,which is relevant to current multi-task and multilingual training paradigms. 2. The hierarchical bilevel optimization formulation is conceptually interesting and provides a unified framework for global and local data balancing. 3. The paper is well-written and easy to follow.
I have the following concerns. *If the authors could properly address them during the rebuttal phase, I am willing to raise my score.* 1. The technical novelty is somewhat limited. While the hierarchical structure and bilevel setup are well-motivated, they mainly combine known techniques such as policy gradients and dynamic sampling into a straightforward framework, without introducing fundamentally new optimization principles. 2. This paper lacks strong theoretical or analytical justification f
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
