LDC-MTL: Balancing Multi-Task Learning through Scalable Loss Discrepancy Control
Peiyao Xiao, Chaosheng Dong, Shaofeng Zou, Kaiyi Ji

TL;DR
LDC-MTL introduces a scalable bilevel optimization method for multi-task learning that effectively balances tasks with minimal computational overhead, outperforming existing gradient manipulation techniques.
Contribution
It proposes a novel bilevel optimization framework for MTL that ensures balanced task learning with only constant time and memory complexity per iteration.
Findings
LDC-MTL achieves superior accuracy on multi-task datasets.
The method converges to a Pareto stationary point under mild conditions.
It significantly reduces computational overhead compared to existing methods.
Abstract
Multi-task learning (MTL) has been widely adopted for its ability to simultaneously learn multiple tasks. While existing gradient manipulation methods often yield more balanced solutions than simple scalarization-based approaches, they typically incur a significant computational overhead of in both time and memory, where is the number of tasks. In this paper, we propose LDC-MTL, a simple and scalable loss discrepancy control approach for MTL, formulated from a bilevel optimization perspective. Our method incorporates two key components: (i) a bilevel formulation for fine-grained loss discrepancy control, and (ii) a scalable first-order bilevel algorithm that requires only time and memory. Theoretically, we prove that LDC-MTL guarantees convergence not only to a stationary point of the bilevel problem with loss discrepancy control but also to an…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper's main strengths lie in its clear bilevel formulation for controlling task-wise loss discrepancies, which provides a principled alternative to existing loss balancing methods. The proposed approach is both scalable and lightweight, featuring a single-loop update that avoids the typical $\mathcal{O}(K)$ gradient storage overhead found in many multi-task algorithms. Theoretical analysis further reinforces the work by providing convergence guarantees that ensure Pareto stationarity under
1. The claimed $O(1)$ efficiency critically depends on the empirical observation that $||\nabla_W g(W^t,z^t_N)||$ remains small. While this is illustrated for the Cityscapes dataset $(K = 2)$, providing similar empirical evidence for other datasets, particularly CelebA $(K = 40)$, would strengthen the generality of this claim. 2. The experimental analyses on loss discrepancy and gradient conflict (Table $3$, Figures $6$ and $7$) are conducted only against linear scalarization. For a fairer asse
1. The proposed method avoids solving the complex bi-level structure in BLO. It also achieves good performance in MTL. 2. The paper structure is clear and easy to follow. 3. A theoretical proof to support the effectiveness of the proposed method.
1. Lacking $\mathcal{O}(1)$ baselines such as "Smooth Tchebycheff Scalarization for Multi-Objective Optimization, ICML 2024." 2. Using bi-level optimization to solve the MTL problem is not new. The "related work" part does not mention those works, such as "Multi-objective meta-learning, AIJ" and "A first-order multi-gradient algorithm for multi-objective bi-level optimization, ECAI 2024". 3. The proposed approach is quite like a penalty-based bi-level optimization approach, such as "On Penalty
1. The bilevel formulation provides a clear and principled avenue for directly controlling loss discrepancies while aiming for balanced performance across tasks (Section 4). 2. The authors conducted extensive experiments to verify the effectiveness of the proposed LDC-MTL. 3. The authors provide a thorough theoretical analysis, showing that the algorithm achieves $\epsilon$-Pareto stationarity under standard Lipschitz and PL conditions.
1. The authors claim that LDC-MTL achieves $\mathcal{O}(1)$ memory and time overhead per iteration. However, this property has already been stated in prior work such as FAMO. In fact, all loss-based MTL methods (e.g., UW, DWA, and more recently GO4Align) inherently possess this characteristic, as they aggregate task losses directly rather than computing gradients for each task separately—thus requiring only a single backward pass per iteration. Therefore, this advantage cannot be regarded as a d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics
