A stochastic gradient method for trilevel optimization
Tommaso Giovannelli, Griffin Dean Kent, Luis Nunes Vicente

TL;DR
This paper introduces the first stochastic gradient descent method for unconstrained trilevel optimization, addressing inexactness and noise in gradient computations, with theoretical convergence guarantees and promising numerical results.
Contribution
It presents a novel stochastic gradient method for trilevel problems with comprehensive convergence analysis under various inexactness conditions.
Findings
Convergence theory covers all forms of inexactness in gradient computations.
Numerical experiments show effectiveness on synthetic and hyperparameter tuning problems.
First stochastic method specifically designed for unconstrained trilevel optimization.
Abstract
With the success that the field of bilevel optimization has seen in recent years, similar methodologies have started being applied to solving more difficult applications that arise in trilevel optimization. At the helm of these applications are new machine learning formulations that have been proposed in the trilevel context and, as a result, efficient and theoretically sound stochastic methods are required. In this work, we propose the first-ever stochastic gradient descent method for solving unconstrained trilevel optimization problems and provide a convergence theory that covers all forms of inexactness of the trilevel adjoint gradient, such as the inexact solutions of the middle-level and lower-level problems, inexact computation of the trilevel adjoint formula, and noisy estimates of the gradients, Hessians, Jacobians, and tensors of third-order derivatives involved. We also…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper addresses an underexplored area by extending stochastic gradient methods to trilevel optimization. 2. The convergence analysis is detailed. The experimental section includes both synthetic and application-based evaluations.
1. The theoretical results rely on strong convexity of both the middle- and lower-level problems, which greatly limits applicability to real-world scenarios. In particular, Assumption 3.3 is justified by the authors through the case where the LL problem is a QP; however, in that case, the entire trilevel structure can be reformulated as a linearly constrained bilevel problem, for which many well-established algorithms already exist. Many real-world multi-level problems (e.g., in adversarial trai
- The paper formalizes the trilevel adjoint gradient through a clean extension of bilevel implicit differentiation. - It provides a comprehensive theoretical treatment with explicit assumptions (smoothness, strong convexity, unbiased gradient noise, bounded Hessians). - Experiments include synthetic and real data settings, showing numerical feasibility.
- The formal definition of trilevel optimization (Eq. TLO) is incorrect or at least misleading: it allows $y, z \in \arg\min f_2(x, y, z), \ z\in \arg\min f_3(x, y, z)$ which implies that the middle level and the lower level both optimize $z$. This formulation conflicts with the intended hierarchical semantics. But later in line 110-113, we could know that the middle level should optimize only $y$. - TSG is essentially a mechanical extension of stochastic bilevel frameworks (e.g., StocBiO and
As a theory-driven article, it clearly articulates the problem formulation, hypotheses, algorithm derivation, and convergence rate.
I believe the article has several major weaknesses: 1. The novelty and technical contributions of the paper are clearly insufficient, and I don't think it meets the conference standards. The results presented are a natural extension of bilevel optimization, and I do not see any significant challenges in this extension. 2. The problem formulation in the paper relies on several strong assumptions. For example, Assumption 3.3 is a very strong assumption, and since $\bar f$ is an implicit objectiv
- **S1**: The paper tackle trilevel optimization which has been little explored in the community.
- **W1**: The introduction of the problem could be improved for clarity. It would be helpful to explicitly state that the $f_i$ are expectations, as I guess by reading the paper. Without this clarification, the motivation for using a stochastic solver is unclear. Additionally, sample derivatives, like $\nabla_z f_3(x, y^{i,j}, z^{i,j,k}; \xi^{i,j,k})$ (line 146), are not defined. Stating immediately from the beginning that $f_3(x, y, z) = \mathbb{E}_\xi[f_3(x, y, z;\xi)]$ would be sufficient to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Sparse and Compressive Sensing Techniques
