MGDA Converges under Generalized Smoothness, Provably
Qi Zhang, Peiyao Xiao, Shaofeng Zou, Kaiyi Ji

TL;DR
This paper extends the convergence analysis of the MGDA algorithm to generalized smooth loss functions common in neural networks, providing theoretical guarantees and efficient variants for multi-objective optimization.
Contribution
It introduces convergence guarantees for MGDA under generalized smoothness conditions and proposes an efficient MGDA-FA variant with similar performance.
Findings
MGDA converges to an $\e$-accurate Pareto stationary point under generalized smoothness.
The stochastic MGDA requires $ ext{O}(\e^{-4})$ samples for convergence.
MGDA-FA achieves the same guarantees with constant time and space complexity.
Abstract
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard -smooth or bounded-gradient assumptions, which typically do not hold for neural networks, such as Long short-term memory (LSTM) models and Transformers. In this paper, we study a more general and realistic class of generalized -smooth loss functions, where is a general non-decreasing function of gradient norm. We revisit and analyze the fundamental multiple gradient descent algorithm (MGDA) and its stochastic version with double sampling for solving the generalized -smooth MOO problems, which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of…
Peer Reviews
Decision·ICLR 2025 Poster
- **Presentation**: The paper is easy to follow, even for readers without expertise in MOO. The MOO problem under generalized smoothness, the MGDA algorithm, and the warm-start procedure are well-motivated. Further, the proof sketch also provides a good overview of key ideas, atleast for average CA distance. - **Novel algorithms**: A warm-start procedure appears in (Xiao et al 2023), however, it requires running two-loops, one of warm start of $w$ and other for $x$. In contrast, the proposed w
- **Theory**: - **Definition of $\mathcal{W}$:** The authors have not defined the set $\mathcal{W}$, however across the proof(Line 809, Eq 11), they use $\max_{w\in \mathcal{W}}\|w\| \leq 1$. It seems to be the unit sphere in $K$ dimensions. - **Choice of step size does not work in Theorem 2:** The parameters $\alpha,\beta,\rho = \mathcal{O}(\epsilon^2)$ do not work out for Theorem 2 as each of $\alpha, \beta$ and $\rho$ depend on other two. Consider $\rho \leq \frac{1}{\sqrt{\alpha T}}
- The analysis of MGDA under generalized smoothness is new. Their analysis further relax the assumptions such as bounded function values or bounded gradients assumed in prior work. - The authors use a general notion of generalized smoothness, i.e., in terms of a non-decreasing function $\ell$ instead of the original one with $ell(a) = L_0 + L_1 a$. - The authors further study a variant of MGDA, which approximates the gradient in each iteration to save memory and time.
- The notation of this paper is a bit unclear. In multiple places, the authors introduce something before they define the notation, so the write-up should be refined. For example, the authors should define $\mathcal{W}$ in Definition 3. - The preliminaries on generalized smoothness from Section 2.1 is redundant, as those are not proposed by this work and not the main contribution here. - All the complexity results of this paper are only stated in terms of $\epsilon$. A more clear statements wit
1. The theoretical analysis is comprehensive and robust. 2. The paper is clearly written and well-organized, making the concepts easy to follow.
1. Assumption 2, which posits that $\phi(a) = \frac{a^2}{2\ell(2a)}$ is monotonically increasing, restricts $\ell(a) \leq \mathcal{O}(a^2)$. This places a strict limitation on the class of generalized $\ell$-smooth functions considered. 2. The paper's novelty is somewhat limited. From an algorithmic standpoint, MGDA was introduced several years ago, and the fast approximation presented here constitutes a relatively minor modification. 3. The novelty of the analytical techniques is also limited,
Videos
Taxonomy
TopicsAdvanced Multi-Objective Optimization Algorithms
