Implicit Bilevel Optimization: Differentiating through Bilevel   Optimization Programming

Francesco Alesiani

arXiv:2302.14473·cs.LG·March 1, 2023

Implicit Bilevel Optimization: Differentiating through Bilevel Optimization Programming

Francesco Alesiani

PDF

Open Access 1 Video

TL;DR

This paper introduces BiGrad, a novel method for differentiating through bilevel optimization problems, enabling end-to-end learning in models that incorporate bilevel programming, applicable to both continuous and combinatorial cases.

Contribution

It extends single-level optimization approaches to bilevel programming, providing a general, efficient framework for differentiating through complex bilevel problems in machine learning.

Findings

01

BiGrad effectively extends single-level methods to bilevel programming.

02

The approach reduces computational complexity for combinatorial problems.

03

Experiments demonstrate successful integration of bilevel optimization in learning models.

Abstract

Bilevel Optimization Programming is used to model complex and conflicting interactions between agents, for example in Robust AI or Privacy-preserving AI. Integrating bilevel mathematical programming within deep learning is thus an essential objective for the Machine Learning community. Previously proposed approaches only consider single-level programming. In this paper, we extend existing single-level optimization programming approaches and thus propose Differentiating through Bilevel Optimization Programming (BiGrad) for end-to-end learning of models that use Bilevel Programming as a layer. BiGrad has wide applicability and can be used in modern machine learning frameworks. BiGrad is applicable to both continuous and combinatorial Bilevel optimization problems. We describe a class of gradient estimators for the combinatorial case which reduces the requirements in terms of computation…

Tables4

Table 1. Table 1: Optimal Control Average Cost; Bilevel approach improves (lower cost) over the two-step approach because is able to better capture the interaction between noise and control dynamics.

	LQR	OptNet	Bilevel
Adversarial	2.736	0.2722	0.2379
(10 steps)
(30 steps)	-	0.2511	0.2181

Table 2. Table 2: Performance on the adversarial attack with discrete features, with Q = 10 𝑄 10 Q=10 . DCNN is the single level discrete CNN, Bi-DCNN is the bilevel discrete CNN, CNN is the vanilla CNN, while CNN* is the CNN where we add the bilevel discrete layer after vanilla training.

$L_{\infty} \leq α$	DCNN	Bi-DCNN	CNN	CNN*
0	62.9 $\pm$ 0.3	64.0 $\pm$ 0.4	63.4 $\pm$ 0.7	63.6 $\pm$ 0.5
5	42.6 $\pm$ 1.0	44.5 $\pm$ 0.2	43.8 $\pm$ 1.2	44.3 $\pm$ 1.0
10	23.5 $\pm$ 1.5	25.3 $\pm$ 0.8	24.3 $\pm$ 1.0	24.2 $\pm$ 1.0
15	14.4 $\pm$ 1.4	15.6 $\pm$ 0.7	14.6 $\pm$ 0.7	14.3 $\pm$ 0.4
20	9.1 $\pm$ 1.2	10.0 $\pm$ 0.6	9.2 $\pm$ 0.4	8.9 $\pm$ 0.2
25	6.1 $\pm$ 1.0	6.8 $\pm$ 0.5	6.0 $\pm$ 0.2	5.9 $\pm$ 0.2
30	3.9 $\pm$ 0.7	4.4 $\pm$ 0.5	3.9 $\pm$ 0.2	3.9 $\pm$ 0.1

Table 3. Table 3: Performance on the Dynamic Programming Problem with Interdiction. SL uses ResNet18.

gradient	accuracy [12x12 maps]		accuracy [18x18 maps]		accuracy [24x24 maps]
type	train	validation	train	validation	train	validation
BiGrad (BB)	95.8 $\pm$ 0.2	94.5 $\pm$ 0.2	97.1 $\pm$ 0.0	96.4 $\pm$ 0.2	98.0 $\pm$ 0.0	97.8 $\pm$ 0.0
BiGrad (PT)	91.7 $\pm$ 0.1	91.6 $\pm$ 0.1	94.3 $\pm$ 0.0	94.2 $\pm$ 0.1	95.7 $\pm$ 0.0	95.6 $\pm$ 0.1
BB-1	95.9 $\pm$ 0.2	91.7 $\pm$ 0.1	96.7 $\pm$ 0.2	94.5 $\pm$ 0.1	97.1 $\pm$ 0.1	96.3 $\pm$ 0.2
PT-1	88.3 $\pm$ 0.2	87.5 $\pm$ 0.2	90.9 $\pm$ 0.4	90.6 $\pm$ 0.5	92.8 $\pm$ 0.1	92.8 $\pm$ 0.2
SL	100.0 $\pm$ 0.0	26.2 $\pm$ 2.4	99.9 $\pm$ 0.1	20.2 $\pm$ 0.5	99.1 $\pm$ 0.2	14.0 $\pm$ 1.0

Table 4. Table 4: Performance in terms of the accuracy of the TSP use case with interdiction. SL has higher accuracy during train but fails at test time. BB and PT are BiGrad variants.

gradient		accuracy			accuracy			accuracy
type	k	train	validation	k	train	validation	k	train	validation
BB	8	89.2 $\pm$ 0.1	89.4 $\pm$ 0.2	10	91.9 $\pm$ 0.1	92.0 $\pm$ 0.1	12	93.5 $\pm$ 0.1	93.5 $\pm$ 0.2
PT	8	89.3 $\pm$ 0.0	89.4 $\pm$ 0.1	10	92.0 $\pm$ 0.0	91.9 $\pm$ 0.1	12	93.7 $\pm$ 0.1	93.7 $\pm$ 0.1
BB-1	8	84.0 $\pm$ 0.4	83.9 $\pm$ 0.4	10	87.4 $\pm$ 0.3	87.5 $\pm$ 0.4	12	89.3 $\pm$ 0.1	89.3 $\pm$ 0.1
PT-1	8	84.1 $\pm$ 0.4	84.1 $\pm$ 0.3	10	87.3 $\pm$ 0.3	87.0 $\pm$ 0.3	12	89.3 $\pm$ 0.0	89.5 $\pm$ 0.2
SL	8	94.2 $\pm$ 5.0	10.7 $\pm$ 3.9	10	92.7 $\pm$ 5.4	9.4 $\pm$ 0.4	12	91.4 $\pm$ 2.3	9.3 $\pm$ 1.2

Equations70

x \in X min

x \in X min

x \in X min ⟨ z, x ⟩_{A} + ⟨ y, x ⟩_{B}, \leavevmode \leavevmode y \in ar g y \in Y min ⟨ w, y ⟩_{C} + ⟨ x, y ⟩_{D}

x \in X min ⟨ z, x ⟩_{A} + ⟨ y, x ⟩_{B}, \leavevmode \leavevmode y \in ar g y \in Y min ⟨ w, y ⟩_{C} + ⟨ x, y ⟩_{D}

F (x, y, z)

F (x, y, z)

F (x, y, z)

F (x, y, z)

d_{z} L

d_{z} L

d_{z} L

d_{z} L

d_{w} L

\nabla_{z} L

\nabla_{z} L

\nabla_{w} L

\nabla_{z, w} L

\nabla_{z, w} L

L^{2} (x, y) = L^{2} (x) + L^{2} (y)

L^{2} (x, y) = L^{2} (x) + L^{2} (y)

\nabla_{z} L

\nabla_{z} L

ϕ min

ϕ min

x_{t + 1} = A x_{t} + B ϕ (x_{t}) + w_{t}, \forall t

\displaystyle\min_{u_{t}}\leavevmode\nobreak\ \leavevmode\nobreak\

\displaystyle\min_{u_{t}}\leavevmode\nobreak\ \leavevmode\nobreak\

\displaystyle\max_{\epsilon}\leavevmode\nobreak\ \leavevmode\nobreak\

\displaystyle\max_{\epsilon}\leavevmode\nobreak\ \leavevmode\nobreak\

u_{t} (ϵ)

x \in Q min y \in B max ⟨ z + x, y ⟩ .

x \in Q min y \in B max ⟨ z + x, y ⟩ .

y \in Y min x \in X max ⟨ z + x ⊙ w, y ⟩

y \in Y min x \in X max ⟨ z + x ⊙ w, y ⟩

y \in Y min x \in X max ⟨ z + x ⊙ w, y ⟩

y \in Y min x \in X max ⟨ z + x ⊙ w, y ⟩

\displaystyle x\to x_{0}+A^{\perp}x,\leavevmode\nobreak\ \leavevmode\nobreak\

\displaystyle x\to x_{0}+A^{\perp}x,\leavevmode\nobreak\ \leavevmode\nobreak\

\displaystyle f\to tf-\sum_{i=1}^{k_{x}}\ln(-f_{i}),\leavevmode\nobreak\ \leavevmode\nobreak\

\displaystyle f\to tf-\sum_{i=1}^{k_{x}}\ln(-f_{i}),\leavevmode\nobreak\ \leavevmode\nobreak\

x min

x min

\leavevmode s.t. \leavevmode A x + z + R (y) (x - r) = b, \leavevmode s \in K

y \in

\leavevmode s.t. \leavevmode B y + u + P (x) (y - p) = f, \leavevmode u \in K

x (u) = x_{0} + A^{⊥} u

x (u) = x_{0} + A^{⊥} u

d L

d L

d F

d G

\nabla_{x} L + \nabla_{x} F λ + \nabla_{x} G γ

\nabla_{x} L + \nabla_{x} F λ + \nabla_{x} G γ

\nabla_{y} L + \nabla_{y} F λ + \nabla_{y} G γ

\nabla_{x} F \nabla_{y} F \nabla_{x} G \nabla_{y} F λ γ = - \nabla_{x} L \nabla_{y} L

\nabla_{x} F \nabla_{y} F \nabla_{x} G \nabla_{y} F λ γ = - \nabla_{x} L \nabla_{y} L

\nabla_{z} x (z, y)

\nabla_{z} x (z, y)

\nabla_{w} y (w, z)

\nabla_{x} y (x, w)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Implicit Bilevel Optimization: Differentiating through Bilevel Optimization Programming· underline

Taxonomy

TopicsRisk and Portfolio Optimization · Bayesian Modeling and Causal Inference · Stochastic Gradient Optimization Techniques

Full text

Implicit Bilevel Optimization: Differentiating through Bilevel Optimization Programming

Francesco Alesiani

Abstract

Bilevel Optimization Programming is used to model complex and conflicting interactions between agents, for example in Robust AI or Privacy-preserving AI. Integrating bilevel mathematical programming within deep learning is thus an essential objective for the Machine Learning community. Previously proposed approaches only consider single-level programming. In this paper, we extend existing single-level optimization programming approaches and thus propose Differentiating through Bilevel Optimization Programming (BiGrad) for end-to-end learning of models that use Bilevel Programming as a layer. BiGrad has wide applicability and can be used in modern machine learning frameworks. BiGrad is applicable to both continuous and combinatorial Bilevel optimization problems. We describe a class of gradient estimators for the combinatorial case which reduces the requirements in terms of computation complexity; for the case of the continuous variable, the gradient computation takes advantage of the push-back approach (i.e. vector-jacobian product) for an efficient implementation. Experiments show that the BiGrad successfully extends existing single-level approaches to Bilevel Programming.

1 Introduction

Neural networks provide unprecedented improvements in perception tasks, however, deep neural networks do not natively protect against adversarial attacks nor preserve the privacy of the training dataset. In recent years various approaches have been proposed to overcome this limitation (Shafique et al. 2020), for example by integrating adversarial training (Xiao et al. 2020). Some of these approaches require solving some optimization problems during training. Recent approaches propose thus differentiable layers that incorporate either quadratic (Amos and Kolter 2017), convex (Agrawal et al. 2019a), cone (Agrawal et al. 2019b), equilibrium (Bai, Kolter, and Koltun 2019), SAT (Wang et al. 2019) or combinatorial (Pogančić et al. 2019; Mandi and Guns 2020; Berthet et al. 2020) programs. The use of optimization programming as a layer of differentiable systems requires computing the gradients through these layers. With discrete variables, the gradient is zero almost everywhere, while with complex (black box) solvers, the gradient may not be accessible.

Proposed gradient estimates either relax the combinatorial problem (Mandi and Guns 2020), perturb the input variables (Berthet et al. 2020; Domke 2010) or linearly approximate the loss function (Pogančić et al. 2019). These approaches though, do now allow to directly express models with conflicting objectives, for example in structural learning (Elsken, Metzen, and Hutter 2019) or adversarial system (Goodfellow et al. 2014). We thus consider the use of bilevel optimization programming as a layer. Bilevel Optimization Program (Kleinert et al. 2021; Dempe 2018), also known as a generalization of Stackelberg Games, is the extension of a single-level optimization program, where the solution of one optimization problem (i.e. the outer problem) depends on the solution of another optimization problem (i.e. the inner problem). This class of problems can model interactions between two actors, where the action of the first depends on the knowledge of the counter-action of the second. Bilevel Programming finds application in various domains, as in Electricity networks, Economics, Environmental policy, Chemical plants, defense, and planning (Dempe 2018). We introduce at the end of the section example applications of Bilevel Optimization Programming.

In general, Bilevel programs are NP-hard (Dempe 2018), they require specialized solvers and it is not clear how to extend single-level approaches since the standard chain rule is not directly applicable. By modeling the bilevel optimization problem as an implicit layer (Bai, Kolter, and Koltun 2019), we consider the more general case where 1) the solution of the bilevel problem is computed by a bilevel solver; thus leveraging on powerfully solver developed over various decades (Kleinert et al. 2021); and 2) the computation of the gradient is more efficient since we do not have to propagate gradient through the solver.

We thus propose Differentiating through Bilevel Optimization Programming (BiGrad):

•

BiGrad (section 3) comprises of forwarding pass, where existing solvers (e.g. (Yang, Ji, and Liang 2021)) can be used, and backward pass, where BiGrad estimates gradient for both continuous (subsection 2.1, subsection 3.1) and combinatorial (subsection 2.2,subsection 3.2) problems based on sensitivity analysis;

•

we show how the proposed gradient estimators relate to the single-level analogous and that the proposed approach is beneficial in both continuous (subsection 5.1) and combinatorial optimization (subsection 5.2,subsection 5.3,subsection 5.4, ) learning tasks.

Adversarial attack in Machine Learning

Bilevel programming is used the represents the interaction between a machine learning model ( $y$ ) and a potential attacker ( $x$ ) (Goldblum, Fowl, and Goldstein 2019) and is used to increase the resilience to intentional or unintended adversarial attacks.

Min-max problems

Min-max problems are used to model robust optimization problems (Ben-Tal, El Ghaoui, and Nemirovski 2009), where a second variable represents the environment and is constrained to an uncertain set that captures the unknown variability of the environment.

Closed-loop control of physical systems

Bilevel Programming is able to model the interaction of a dynamical system ( $x$ ) and its control sub-system ( $y$ ), as, for example, of an industrial plant or a physical process. The control sub-system changes based on the state of the underlying dynamical system, which itself solves a physics constraint optimization problem (de Avila Belbute-Peres et al. 2018).

Interdiction problems

Two actors’ discrete Interdiction problems (Fischetti et al. 2019) arise when one actor ( $x$ ) tries to interdict the actions of another actor ( $y$ ) under budget constraints. These problems can be found in marketing, protecting critical infrastructure, and preventing drug smuggling to hinder nuclear weapon proliferation.

2 Differentiable Bilevel Optimization Layer

We model the Bilevel Optimization Program as an Implicit Layer (Bai, Kolter, and Koltun 2019), i.e. as the solution of an implicit equation $H(x,y,z)=0$ . We thus compute the gradient using the implicit function theorem, where $z$ is given and represents the parameters of our system we want to estimate, and $x,y$ are output variables (Fig.1). We also assume we have access to a bilevel solver $(x,y)=\text{Solve}_{H}(z)$ , e.g. (Yang, Ji, and Liang 2021). The bilevel Optimization Program is then used as layer of a differentiable system, whose input is $d$ and output is given by $u=h_{\psi}\circ\text{Solve}_{H}\circ h_{\theta}(d)=h_{\psi,\theta}(d)$ , where $\circ$ is the function composition operator. We want to learn the parameters $\psi,\theta$ of the function $h_{\psi,\theta}(d)$ that minimize the loss function $L(h_{\psi,\theta}(d),u)$ , using the training data $D^{\text{tr}}=\{(d,u)_{i=1}^{N^{\text{tr}}}\}$ . In order to be able to perform the end-to-end training, we need to back-propagate the gradient $\mathrm{d}_{z}L$ of the Bilevel Optimization Program Layer, which can not be accomplished only using the chain rule.

2.1 Continuous Bilevel Programming

We now present the definition of the continuous Bilevel Optimization problem, which comprises two non-linear functions $f,g$ , as

[TABLE]

where the left part problem is called outer optimization problem and resolves for the variable $x\in X$ , with $X=\mathbb{R}^{n}$ . The right problem is called the *inner optimization problem * and solves for the variable $y\in Y$ , with $Y=\mathbb{R}^{m}$ . The variable $z\in\mathbb{R}^{p}$ is the input variable and is a parameter for the bilevel problem. Min-max is a special case of Bilevel optimization problem $\min_{y\in Y}\max_{x\in X}g(x,y,z)$ , where the minimization functions are equal and opposite in sign. In Sec.A.1, we describe how the model of Eq. 1 can be extended in the case of linear and nonlinear constraints.

2.2 Combinatorial Bilevel Programming

When the variables are discrete, we restrict the objective functions to be multi-linear (Greub 1967). Various important combinatorial problems are linear in discrete variables (e.g. VRP, TSP, SAT 111Vehicle Routing Problem, Boolean satisfiability problem.), one example form is the following

[TABLE]

The variables $x,y$ have domains in $x\in X,y\in Y$ , where $X,Y$ are convex polytopes that are constructed from a set of distinct points $\mathcal{X}\subset\mathbb{R}^{n},\mathcal{Y}\subset\mathbb{R}^{m},$ as their convex hull. The outer and inner problems are Integer Linear Programs (ILPs). The multi-linear operator is represented by the inner product $\langle x,y\rangle_{A}=x^{T}Ay$ . We only consider the case where we have separate parameters for the outer and inner problems, $z\in\mathbb{R}^{p}$ and $w\in\mathbb{R}^{q}$ .

3 BiGrad: Gradient estimation

BiGrad provides gradient estimations for both continuous and discrete problems. We can identify the following common basic steps (Alg.1):

In the forward pass, solve the combinatorial or continuous Bilevel Optimisation problem as defined in Eq.1(or Eq.2) using existing solver ( $\text{Solve}_{H}(z)$ ) e.g. (Yang, Ji, and Liang 2021); 2. 2.

During the backward pass, compute the gradient $\mathrm{d}_{z}L$ (and $\mathrm{d}_{w}L$ ) using the suggested gradients (Sec.3.1 and Sec.3.2) starting from the gradients on the output variables $\nabla_{x}L$ and $\nabla_{y}L$ .

3.1 Continuous Optimization gradient estimation

To evaluate the gradient of the variables $z$ versus the loss function $L$ , we need to propagate the gradients of the two output variables $x,y$ through the two optimization problems. We can use the implicit function theorem to approximate locally the function $z\to(x,y)$ . We have thus the following main results222Proofs are in the Supplementary Material.

Theorem 1.

*Considering the bilevel problem of Eq.1, we can build the following set of equations that represent the equivalent problem around a given solution $x^{*},y^{*},z^{*}$ : *

[TABLE]

*where *

[TABLE]

where we used the short notation $f=f(x,y,z),g=g(x,y,z),F=F(x,y,z),G=G(x,y,z)$

Theorem 2.

Consider the problem defined in Eq.1, then the total gradient of the parameter $z$ w.r.t. the loss function $L(x,y,z)$ is computed from the partial gradients $\nabla_{x}L,\nabla_{y}L,\nabla_{z}L$ as

[TABLE]

The implicit layer is thus defined by the two conditions $F(x,y,z)=0$ and $G(x,y,z)=0$ . We notice that Eq.5 can be solved without explicitly computing the Jacobian matrices and inverting the system, but by adopting the Vector-Jacobian product approach we can proceed from left to right to evaluate $\mathrm{d}_{z}L$ . In the following section, we describe how affine equality constraints and nonlinear inequality can be used when modeling $f,g$ . We also notice that the solution of Eq.5 does not require solving the original problem, but only applying matrix-vector products, i.e. linear algebra, and the evaluation of the gradient that can be computed using automatic differentiation. The extension of Theorem.2 to cone programming is presented in Sec.A.2.

3.2 Combinatorial Optimization gradient estimation

When we consider discrete variables, the gradient is zero almost everywhere. We thus need to resort to estimating gradients. For the bilevel problem with discrete variables of Eq.2, when the solution of the bilevel problem exists and its solution is given (Kleinert et al. 2021), Thm.3 gives the gradients of the loss function with respect to the input parameters.

Theorem 3.

Given the Eq.2 problem, the partial variation of a cost function $L(x,y,z,w)$ on the input parameters has the following form:

[TABLE]

The $\nabla_{x}y,\nabla_{y}x$ terms capture the interaction between outer and inner problems. We could estimate the gradients in Thm.3 using the perturbation approach suggested in (Berthet et al. 2020), which estimates the gradient as the expected value of the gradient of the problem after perturbing the input variable, but, similar to REINFORCE (Williams 1992), this introduces large variance. While it is possible to reduce variance in some cases (Grathwohl et al. 2017) with the use of additional trainable functions, we consider alternative approaches as described in the following.

Differentiation of black box combinatorial solvers

(Pogančić et al. 2019) propose a way to propagate the gradient through a single-level combinatorial solver, where $\nabla_{z}L\approx\frac{1}{\tau}[x(z+\tau\nabla_{x}L)-x(z)]$ when $x(z)=\arg\max_{x\in X}\langle x,z\rangle$ . We thus propose to compute the variation on the input variables from the two separate problems of the Bilevel Problem:

[TABLE]

or alternatively, if we have only access to the Bilevel solver and not to the separate ILP solvers, we can express

[TABLE]

where $x(z,y)$ and $y(w,x)$ represent the solutions of the two problems separately, $s(v)=(z,w)\to(x,y)$ the complete solution to the Bilevel Problem, $\tau\to 0$ is a hyper-parameter and $E=\begin{bmatrix}A&0\\ 0&C\end{bmatrix}$ . This form is more convenient than Eq.6 since it does not require computing the cross terms, ignoring thus the interaction of the two levels.

Straight-Through gradient

In estimating the input variables $z,w$ of our model, we may not be interested in the interaction between the two variables $x,y$ . Let us consider, for example, the squared $\ell_{2}$ loss function defined over the output variables

[TABLE]

where $L^{2}(x)=\frac{1}{2}\|x-x^{*}\|^{2}_{2}$ and $x^{*}$ is the true value. The loss is non-zero only when the two vectors disagree, and with integer variables, it counts the difference squared, or, in the case of the binary variables, it counts the number of differences. If we compute $\nabla_{x}L^{2}(x)=(x-x^{*})$ in the binary case, we have that $\nabla_{x_{i}}L^{2}(x)=+1$ if $x^{*}_{i}=0\land x_{i}=1$ , $\nabla_{x_{i}}L^{2}(x)=-1$ if $x^{*}_{i}=1\land x_{i}=0$ , and [math] otherwise. This information can be directly used to update the $z_{i}$ variable in the linear term $\langle z,x\rangle$ , thus we can estimate the gradients of the input variables as $\nabla_{z_{i}}L^{2}=-\lambda\nabla_{x_{i}}L^{2}$ and $\nabla_{w_{i}}L^{2}=-\lambda\nabla_{y_{i}}L^{2}$ , with some weight $\lambda>0$ . The intuition is that the weight $z_{i}$ associated with the variable $x_{i}$ is increased when the value of the variable $x_{i}$ reduces. In the general multilinear case, we have additional multiplicative terms. Following this intuition (see Sec.A.3), we thus use as an estimate of the gradient of the variables

[TABLE]

This is equivalent in Eq.2 where $\nabla_{z}x=\nabla_{w}y=-I$ and $\nabla_{y}x=0$ , thus $\nabla_{x}y=0$ . This update is also equivalent to Eq.7, without the solution computation. The advantage of this form is that it does not require solving for an additional solution in the backward pass. For the single-level problem, the gradient has the same form as the Straight-Through gradient proposed by (Bengio, Léonard, and Courville 2013), with surrogate gradient $\nabla_{z}x=-I$ .

4 Related Work

Bilevel Programming in machine learning

Various papers model machine learning problems as Bilevel problems, for example in Hyper-parameter Optimization (MacKay et al. 2019; Franceschi et al. 2018), Meta-Feature Learning (Li and Malik 2016), Meta-Initialization Learning (Rajeswaran et al. 2019), Neural Architecture Search (Liu, Simonyan, and Yang 2018), Adversarial Learning (Li et al. 2019) and Multi-Task Learning (Alesiani et al. 2020). In these works, the main focus is to compute the solution to the bilevel optimization problems. In (MacKay et al. 2019; Lorraine and Duvenaud 2018), the best response function is modeled as a neural network and the solution is found using iterative minimization, without attempting to estimate the complete gradient. Many bilevel approaches rely on the use of the implicit function to compute the hyper-gradient (Sec. 3.5 of (Colson, Marcotte, and Savard 2007)) but do not use bilevel as a layer.

Quadratic, Cone and Convex single-level Programming

Various works have addressed the problem of differentiate through quadratic, convex, or cone programming (Amos 2019; Amos and Kolter 2017; Agrawal et al. 2019b, a). In these approaches, the optimization layer is modeled as an implicit layer and for the cone/convex case, the normalized residual map is used to propagate the gradients. Contrary to our approach, this work only addresses single-level problems. These approaches do not consider combinatorial optimization.

Implicit layer Networks

While classical deep neural networks perform a single pass through the network at inference time, a new class of systems performs inference by solving an optimization problem. Examples of this are Deep Equilibrium Network (DEQ) (Bai, Kolter, and Koltun 2019) and NeurolODE (NODE) (Chen et al. 2018). Similar to our approach, the gradient is computed based on a sensitivity analysis of the current solution. These methods only consider continuous optimization.

Combinatorial Optimization (CO)

Various papers estimate gradients of single-level combinatorial problems using relaxation. (Wilder, Dilkina, and Tambe 2019; Elmachtoub and Grigas 2017; Ferber et al. 2020; Mandi and Guns 2020) for example use $\ell_{1},\ell_{2}$ or log barrier to relax the Integer Linear Programming (ILP) problem. Once relaxed the problem is solved using standard methods for continuous variable optimization. An alternative approach is suggested in other papers. For example, in (Pogančić et al. 2019) the loss function is approximated with a linear function and this leads to an estimate of the gradient of the input variable similar to the implicit differentiation by perturbation form (Domke 2010). (Berthet et al. 2020) is another approach that uses also perturbation and change of variables to estimate the gradient in an ILP problem. SatNet (Wang et al. 2019) solves MAXSAT problems by solving a continuous semidefinite program (SDP) relaxation of the original problem. These works only consider single-level problems.

Discrete latent variables

Discrete random variables provide an effective way to model multi-modal distributions over discrete values, which can be used in various machine learning problems. Gradients of discrete distribution are not mathematically defined, thus, in order to use the gradient-based method, gradient estimations have been proposed. A class of methods is based on the Gumbel-Softmax estimator (Maddison, Mnih, and Teh 2016). Gradient estimation of the exponential family of distributions over discrete variables is estimated using the perturb-and-MAP method in (Niepert, Minervini, and Franceschi 2021).

Predict then optimize

Predict then Optimize (two-stage) (Elmachtoub and Grigas 2017; Ferber et al. 2020) or solving linear programs and submodular maximization from (Wilder, Dilkina, and Tambe 2019) solve optimization problems when the cost variable or the minimization function is directly observable. On the contrary, in our approach we only have access to a loss function on the output of the bilevel problem, thus allowing us to use it as a layer.

Neural Combinatorial Optimization (NCO)

NCO employs deep neural networks to derive efficient CO heuristics. NCO includes supervised learning (Joshi, Laurent, and Bresson 2019) and reinforcement learning (Kool, Van Hoof, and Welling 2019).

5 Experiments

We evaluate BiGrad with continuous and combinatorial problems to show that improves over single-level approaches. In the first experiment, we compare the use of BiGrad versus the use of the implicit layer proposed in (Amos and Kolter 2017) for the design of Optimal Control with adversarial noise. In the second part, after experimenting with an adversarial attack, we explore the performance of BiGrad with two combinatorial problems with Interdiction, where we adapted the experimental setup proposed in (Pogančić et al. 2019). In these latter experiments, we compare the formulation in Eq.8 (denoted by Bigrad(BB)) and the formulation of Eq.9 (denoted by Bigrad(PT)). In addition, we compare with the single level BB-1 from (Pogančić et al. 2019) and single level straight-through (Bengio, Léonard, and Courville 2013; Paulus, Maddison, and Krause 2021), with the surrogate gradient $\nabla_{z}x=-I$ , (PT-1) gradient estimations. We compare against Supervised learning (SL), which ignores the underlying structure of the problem and directly predicts the solution of the bilevel problem.

5.1 Optimal Control with adversarial disturbance

We consider the design of robust stochastic control for a Dynamical System (Agrawal et al. 2019b). The problem is to find a feedback function $u=\phi(x)$ that minimizes

[TABLE]

where $x_{t}\in\mathbb{R}^{n}$ is the state of the system, while $w_{t}$ is a i.i.d. random disturbance and $x_{0}$ is given initial state.

To solve this problem we use Approximate Dynamic Programming (ADP) (Wang and Boyd 2010) that solves a proxy quadratic problem

[TABLE]

We can use the optimization layer as shown in Fig.2(a) and update the problem variables (e.g. $P,Q,q$ ) using gradient descent. We use the linear quadratic regulator (LQR) solution as the initial solution (Kalman 1964). The optimization module is replicated for each time step $t$ , similarly to the Recursive Neural Network (RNN).

We can build a resilient version of the controller in the hypothesis that an adversarial is able to inject a noise of limited energy, but is arbitrary dependent on the control $u$ , by solving the following bilevel optimization problem

[TABLE]

where $Q(u,x)=u^{T}Pu+x_{t}Qu+q^{t}u$ and we want to learn the parameters $z=(P,Q,q)$ , where $y=u_{t},x=\epsilon$ of Eq.1.

We evaluate the performance to verify the viability of the proposed approach and compare with LQR and OptNet (Amos and Kolter 2017), where the outer problem is substituted with the best response function that computes the adversarial noise based on the computed output; in this case, the adversarial noise is a scaled version of $Qu$ of Eq.11. Tab.1 and Fig.2(b) present the performance using BiGrad, LQR and the adversarial version of OptNet. BiGrad improves over two-step OptNet (Tab.1), because is able to better model the interaction between noise and control dynamic.

5.2 Adversarial ML with discrete latent variables

Machine learning models are heavily affected by the injection of intentional noise (Madry et al. 2017; Goodfellow, Shlens, and Szegedy 2014). An adversarial attack typically requires access to the machine learning model, in this way the attack model can be used during training to include its effect. Instead of training an end-to-end system as in (Goldblum, Fowl, and Goldstein 2019), where the attacker is aware of the model, we consider the case where the attacker can inject a noise at the feature level, as opposed to the input level (as in (Goldblum, Fowl, and Goldstein 2019)), this allows us to model the interaction as a bilevel problem. Thus, to demonstrate the use of a bilevel layer, we design a system that is composed of a feature extraction layer, followed by a discretization layer that operates on the space of $\{0,1\}^{m}$ , where $m$ is the hidden feature size, followed by a classification layer. The network used in the experiments is composed of two convolutional layers with max-pooling and two linear layers, all with relu activation functions, while the classification is a linear layer. We consider a more limited attacker that is not aware of the loss function of the model and does not have access to the full model, but rather only to the input of the discrete layer and is able to switch $Q$ discrete variables, The interaction of the discrete layer with the attacker is described by the following bilevel problem:

[TABLE]

where $Q$ represents the sets of all possible attacks, $B$ is the budget of the discretization layer and $y$ is the output of the layer. For the simulation, we compute the solution by sorting the features by values and considering only the first B values, while the attacker will obscure (i.e. set to zero) the first $Q$ positions. The output $y$ thus will have ones on the $Q$ to $B$ non-zero positions, and zero elsewhere. We train three models, on CIFAR-10 dataset for $50$ epochs. For comparison we consider:1) the vanilla CNN network (i.e. without the discrete features); 2) the network with the single-level problem (i.e. the single-level problem without attacker) and; 3) the network with the bilevel problem (i.e. the min-max discretization problem defined in Eq.13). We then test the networks to adversarial attack using the PGD (Madry et al. 2017) attack similar to (Goldblum, Fowl, and Goldstein 2019). Similar results apply for FGSM attack (Fast Gradient Sign Attack) (Goodfellow, Shlens, and Szegedy 2014). We also tested the network trained as a vanilla network, where we added the min-max layer after training. From the results (Tab.2), we notice: 1) The min-max network shows improved resilience to adversarial attack wrt to the vanilla network, but also with respect to the max (single-level) network; 2) The min-max layer applied to the vanilla trained network is beneficial to adversarial attack; 3) The min-max network does not significantly change performance in presence of adversarial attack at the discrete layer (i.e. between Q=0 and Q=10). This example shows how bilevel layers can be successfully integrated into a Machine Learning system as differentiable layers.

5.3 Dynamic Programming: Shortest path with Interdiction

We consider the problem of the Shortest Path with Interdiction, where the set of possible valid paths (see Fig.3(a)) is $Y$ and the set of all possible interdiction is $X$ . The mathematical problem can be written as

[TABLE]

where $\odot$ is the element-wise product. This problem is multi-linear in the discrete variables $x,y,z$ .

The $z,w$ variables are the output of the neural network whose inputs are the Warcraft II tile images. The aim is to train the parameters of the weight network, such that we can solve the shortest path problem only based on the input image. For the experiments, we followed and adapted the scenario of (Pogančić et al. 2019) and used the Warcraft II tile maps of (Guyomarch 2017). We implemented the interdiction Game using a two-stage min-max-min algorithm (Kämmerling and Kurtz 2020). In Fig.3(b) it is possible to see the effect of interdiction on the final solution. Tab.3 shows the performances of the proposed approaches, where we allow for $B=3$ interdictions and we used tile size of $12\times 12$ , $18\times 18$ , $24\times 24$ . The loss function is the Hamming and $\ell_{1}$ loss evaluated on both the shortest path $y$ and the intervention $x$ . The gradient estimated using Eq.8 (BB) provides more accurate results, at double of computation cost of PT. The single-level BB-1 approach outperforms PT, but shares similar computational complexity, while single-level PT-1 is inferior to PT. As expected, SL outperforms other methods during training, but completely fails during validation. Bigrad improves over single-level approaches because includes the interaction of the two problems.

5.4 Combinatorial Optimization: Travel Salesman Problem (TSP) with Interdiction

Travel Salesman Problem (TSP) with interdiction consists of finding the shortest route $y\in Y$ that touches all cities, where some connections $x\in X$ can be removed. The mathematical problem to solve is given by

[TABLE]

where $z,w$ are cost matrices for the salesman and interceptor. Similar to the dynamic programming experiment, we implemented the interdiction Game using a two-stage min-max-min algorithm (Kämmerling and Kurtz 2020). Fig.4 shows the effect of a single interdiction. The aim is to learn the weight matrices, trained with the interdicted solutions on a subset of the cities. Tab.4 describes the performance in terms of accuracy on both shortest tour and intervention. We use Hamming and $\ell_{1}$ loss function. We only allow for $B=1$ intervention but considered $k=8,10$ and $12$ cities from a total of $100$ cities. Single and two-level approaches perform similarly in the training and validation. Since the number of interdiction is limited to one, the performance of the single-level approach is not catastrophic, while the supervised learning approach completely fails in the validation set. Bigrad thus improves over single-level and SL approaches. Since Bigrad(PT) has a similar performance of BiGrad (BB), thus PT is preferable in this scenario, since it requires fewer computation resources.

6 Conclusions

BiGrad generalizes existing single-level gradient estimation approaches and is able to incorporate Bilevel Programming as a learnable layer in modern machine learning frameworks, which allows to model of conflicting objectives as in adversarial attack. The proposed novel gradient estimators are also efficient and the proposed framework is widely applicable to both continuous and discrete problems. The impact of BiGrad has a marginal or similar cost with respect to the complexity of computing the solution of the Bilevel Programming problems. We show how BiGrad is able to learn complex logic when the cost functions are multi-linear.

Ethical Statement and Limitations

The present work does not have ethical implications, but share with all other machine learning approaches the potential to be used in a large multitude of applications; we expect our contribution to be used for the benefit and progress of our society. Our approach models bilevel problems with both discrete and continuous variables, but we have not explored the mixed integer programming approach, with mixed variables. We rely on the use of existing solvers to compute the current solution, thus we leave it to the next work to explore the potential to accelerate solving bilevel problems.

Appendix A Supplementary Material;

Implicit Bilevel Optimization: Differentiating through Bilevel Optimization Programming

A.1 Extension for linear equalities and non-linear inequalities

Linear Equality constraints

To extend the model of Eq.1 to include linear equality constraints of the form $Ax=b$ and $By=c$ on the outer and inner problem variables, we use the following change of variables

[TABLE]

where $A^{\perp},B^{\perp}$ are the orthogonal space of $A$ and $B$ , i.e. $AA^{\perp}=0,BB^{\perp}=0$ , and $x_{0},y_{0}$ are one solution of the equations, i.e. $Ax_{0}=b,By_{0}=c$ .

Non-linear Inequality constraints

Similarly, to extend the model of Eq.1 when we have non-linear inequality constraints, we use the barrier method approach (Boyd and Vandenberghe 2004), where the variable is penalized with a logarithmic function to violate the constraints. Specifically, let us consider the case where $f_{i},g_{i}$ are inequality constraint functions, i.e. $f_{i}<0,g_{i}<0$ , for the outer and inner problems. We then define new functions

[TABLE]

where $t$ is a variable parameter, which depends on the violation of the constraints. The closer the solution is to violate the constraints, the larger the value of $t$ is.

A.2 Bilevel Cone programming

We show here how Theorem.2 can be applied to bi-level cone programming extending single-level cone programming results (Agrawal et al. 2019b), where we can use efficient solvers for cone programs to compute a solution of the bilevel problem (Ouattara and Aswani 2018)

[TABLE]

In this bilevel cone programming, the inner and outer problem are both cone programs, where $R(y),P(x)$ represents a linear transformation, while $C,r,D,p$ are new parameters of the problem, while $\mathcal{K}$ is the conic domain of the variables. In the hypothesis that a local minima of Eq.18 exists, we can use an interior point method to find such point. To compute the bilevel gradient, we then use the residual maps (Busseti, Moursi, and Boyd 2019) of the outer and inner problems. Indeed, we can then apply Theorem 2, where $F=N_{1}(x,Q,y)$ and $G=N_{2}(y,Q,x)$ are the normalized residual maps defined in (Busseti, Moursi, and Boyd 2019; Agrawal et al. 2019a) of the outer and inner problems.

A.3 Proofs

Proof of Linear Equality constraints.

Here we show that

[TABLE]

includes all solution of $Ax=b$ . First we have that $AA^{\perp}=0$ and $Ax_{0}=b$ by definition. This implies that $Ax(u)=A(x_{0}+A^{\perp}u)=Ax_{0}=b$ . Thus $\forall u\to Ax(u)=b$ . The difference $x^{\prime}-x_{0}$ belongs to the null space of $A$ , indeed $A(x^{\prime}-x_{0})=Ax^{\prime}-Ax_{0}=b-b=0$ . The null space of $A$ has size $n-\rho(A)$ . If $\rho(A)=n$ , where $A\in\mathbb{R}^{m\times n},m\geq n$ , then there is only one solution $x=x_{0}=A^{\dagger}b$ , $A^{\dagger}$ the pseudo inverse of $A$ . If $\rho(A)<n$ , then $\rho(A^{\perp}))=n-\rho(A)$ is a based of all vectors s.t. $Ax(u)=b$ , since $\rho(A^{\perp}))=n-\rho(A)$ is the size of the null space of $A$ . In fact $A^{\perp}$ is the base for the null space of $A$ . The same applies for $y(v)=y_{0}+B^{\perp}v$ and $By(v)=c$ . ∎

Proof of Theorem 1.

The second equation is derived by imposing the optimally condition on the inner problem. Since we do not have inequality and equality constraints we optimal solution shall equate the gradient w.r.t. $y$ to zero, thus $G=\nabla_{y}g=0$ . The first equation is also related to the optimality of the $x$ variable w.r.t. to the total derivative or hyper-gradient, thus we have that $0=\mathrm{d}_{x}f=\nabla_{x}f+\nabla_{y}f\nabla_{x}y$ . In order to compute the variation of $y$ , i.e. $\nabla_{x}y$ we apply the implicit theorem to the inner problem, i.e. $\nabla_{x}G+\nabla_{y}G\nabla_{x}y=0$ , thus obtaining $\nabla_{x}y=-\nabla^{-1}_{y}G\nabla_{x}G$ . ∎

Proof of Theorem 2.

In order to prove the theorem, we use the Discrete Adjoin Method (DAM). Let consider a cost function or functional $L(x,y,z)$ evaluated at the output of our system. Our system is defined by the two equations $F=0,G=0$ from Theorem 1. Let us first consider the total variations: $\mathrm{d}L,\leavevmode\nobreak\ \mathrm{d}F=0,\leavevmode\nobreak\ \mathrm{d}G=0$ , where the last conditions are true by definition of the bilevel problem. When we expand the total variations, we obtain

[TABLE]

We now consider $\mathrm{d}L+\mathrm{d}F\lambda+\mathrm{d}G\gamma=[\nabla_{x}L+\nabla_{x}F\lambda+\nabla_{x}G\gamma]\mathrm{d}x+[\nabla_{y}L+\nabla_{y}F\lambda+\nabla_{y}G\gamma]\mathrm{d}y+[\nabla_{z}L+\nabla_{z}F\lambda+\nabla_{z}G\gamma]\mathrm{d}z$ . We ask the first two terms to be zero to find the two free variables $\lambda,\gamma$ :

[TABLE]

or in matrix form

[TABLE]

We can now compute the $\mathrm{d}_{z}L=\nabla_{z}L+\nabla_{z}F\lambda+\nabla_{z}G\gamma$ with $\lambda,\gamma$ from the previous equation. ∎

Proof of Theorem 3.

The partial derivatives are obtained by using the perturbed discrete minimization problems defined by Eqs.24. We first notice that $\nabla_{x}\min_{y\in Y}\langle x,y\rangle=\arg\min_{y\in Y}\langle x,y\rangle$ . This result is obtained by the fact that $\min_{y\in Y}\langle x,y\rangle=\langle x,y^{*}\rangle$ , where $y^{*}=\arg\min_{y\in Y}\langle x,y\rangle$ and applying the gradient w.r.t. the continuous variable $x$ ; while Eqs. 23 are the expected functions of the perturbed minimization problems. Thus, if we compute the gradient of the perturbed minimizer, we obtain the optimal solution, properly scaled by the inner product matrix. For example $\nabla_{x}\tilde{\Phi}_{\eta}=Ax^{*}(z,y)$ , with $A$ the inner product matrix. To compute the variation on the two-parameter variables, we have that $\mathrm{d}L=\nabla_{x}L\mathrm{d}x+\nabla_{y}L\mathrm{d}y+\nabla_{z}L\mathrm{d}z+\nabla_{w}L\mathrm{d}w$ and that $\mathrm{d}w/\mathrm{d}z=0,\mathrm{d}z/\mathrm{d}w=0$ from the dependence diagram of Fig.5 ∎

A.4 Gradient Estimation based on perturbation

We can use the gradient estimator using the perturbation approach proposed in (Berthet et al. 2020). We thus have

[TABLE]

and

[TABLE]

, while

[TABLE]

which are valid under the conditions of (Berthet et al. 2020), while $\tau$ and $\mu$ are hyper-parameters.

A.5 Alternative derivation

Let consider the problem $\min_{x\in K}\langle z,x\rangle_{A}$ and let us define $\Omega_{x}$ a penalty term that ensures $x\in K$ . We can define the generalized lagragian $\mathbb{L}(z,x,\Omega)=\langle z,x\rangle_{A}+\Omega_{x}$ . One example of $\Omega_{x}=\lambda^{T}|x-K(x)|$ or $\Omega_{x}=-\ln{|x-K(x)|}$ where $K(x)$ is the projection into $K$ . To solve the Lagragian, we solve the unconstrained problem $\min_{x}\max_{\Omega_{x}}\mathbb{L}(z,x,\Omega_{x})$ . At the optimal point $\nabla_{x}\mathbb{L}=0$ . Let us define $F=\nabla_{x}\mathbb{L}=A^{T}z+\Omega_{x}^{\prime}$ , then $\nabla_{x}F=\Omega_{x}^{\prime\prime}$ and $\nabla_{z}F=A^{T}$ . If we have $F(x,z)=0$ and a cost function $L(x,z)$ , we can compute $\mathrm{d}_{z}L=\nabla_{z}L-\nabla_{x}L\nabla_{x}^{-1}F\nabla_{z}F$ . Now $F(x,z,\Omega_{x})=0$ , we can apply the previous result and $\mathrm{d}_{z}L=\nabla_{z}L-\nabla_{x}L\Omega_{x}^{\prime\prime-1}A^{T}$ . If we assume $\Omega_{x}^{\prime\prime}=I$ and $\nabla_{z}L=0$ , then $\mathrm{d}_{z}L=-A\nabla_{x}L$ .

A.6 Memory Efficiency

For continuous optimization programming, by separating the computation of the solution and the computation of the gradient around the current solution we 1) compute the gradient more efficiently, in particular, we compute second order gradient taking advantage of the vector-jacobian product (push-back operator) formulation without explicitly inverting and thus building the jacobian or hessian matrices; 2) use more advanced and not differentialble solution techniques to solve the bilevel optimization problem that would be difficult to integrate using automatic differentiable operations. Using VJP we reduce memory use from $O(n^{2})$ to $O(n)$ . Indeed using an iterative solver, like generalized minimal residual method (GMRES) (Saad and Schultz 1986), we only need to evaluate the gradients of Eq.5 and not invert the matrix neither materialize the large matrix and computing matrix-vector products. Similarly, we use Conjugate Gradient (CG) method to compute Eq.4, which requires to only evaluating the gradient at the current solution and nor inverting neither materializing the Jacobian matrix. An implementation of a bilevel solver would have a memory complexity of $O(Tn)$ , where $T$ is the number of iterations of the bilevel algorithm.

A.7 Experimental Setup and Computational Resources

For the Optimal Control with adversarial disturbance, we follow a similar setup of (Agrawal et al. 2019a), where we added the adversarial noise as described in the experiments. For the Combinatorial Optimization, we follow the setup of (Pogančić et al. 2019). The dataset is generated by solving the bilevel problem on the same data of (Pogančić et al. 2019). For section 5.3, we use the warcraft terrain tiles and generate optimal bilevel solution with the correct parameters $(z,w)$ , where $z$ is the terrain transit cost and $w$ is the interdiction cost, considered constant to $1$ in our experiment. $X$ is the set of all feasible interdictions, in our experiment we allow the maximum number of interdictions to be $B$ . For section 5.4, on the other hand the $z$ represents the true distances among cities and $w$ a matrix of the interdiction cost, both unknown to the model. $X$ is the set of all possible interdictions. In these experiments, we solved the bilevel problem using the min-max-min algorithm (Kämmerling and Kurtz 2020). For the Adversarial Attack, we used two convolutional layers with max-pooling, relu activation layer, followed by the discrete layer of size $m=2024$ , $B=100$ , $Q=0,10$ . A final linear classification layer is used to classify CIFAR10. We run over $3$ runs, $50$ epochs, learning rate $lr=3e-4$ and Adam optimizer. Experiments were conducted using a standard server with 8 CPU, 64Gb of RAM and GeForce RTX 2080 GPU with 6Gb of RAM.

A.8 Jacobian-Vector and Vector-Jacobian Products

The Jacobian-Vector Product (JVP) is the operation that computes the directional derivative $J_{f}(x)u$ , with direction $u\in\mathbb{R}^{m}$ , of the multi-dimensional operator $f:\mathbb{R}^{m}\to\mathbb{R}^{n}$ , with respect to $x\in\mathbb{R}^{m}$ , where $J_{f}(x)$ is the Jacobian of $f$ evaluated at $x$ . On the other hand, the Vector-Jacobian product (VJP) operation, with direction $v\in\mathbb{R}^{n}$ , computes the adjoint directional derivative $v^{T}J_{f}(x)$ . JVP and VJP are the essential ingredient for automatic differentiation (Elliott 2018; Baydin et al. 2018).

Bibliography55

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Agrawal et al. (2019 a) Agrawal, A.; Amos, B.; Barratt, S.; Boyd, S.; Diamond, S.; and Kolter, Z. 2019 a. Differentiable convex optimization layers. ar Xiv:1910.12430 .
2Agrawal et al. (2019 b) Agrawal, A.; Barratt, S.; Boyd, S.; Busseti, E.; and Moursi, W. M. 2019 b. Differentiating through a cone program. ar Xiv:1904.09043 .
3Alesiani et al. (2020) Alesiani, F.; Yu, S.; Shaker, A.; and Yin, W. 2020. Towards Interpretable Multi-Task Learning Using Bilevel Programming. ar Xiv:2009.05483 .
4Amos (2019) Amos, B. 2019. Differentiable optimization-based modeling for machine learning . Ph.D. thesis, Ph D thesis. Carnegie Mellon University.
5Amos and Kolter (2017) Amos, B.; and Kolter, J. Z. 2017. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning , 136–145. PMLR.
6Bai, Kolter, and Koltun (2019) Bai, S.; Kolter, J. Z.; and Koltun, V. 2019. Deep equilibrium models. ar Xiv:1909.01377 .
7Baydin et al. (2018) Baydin, A. G.; Pearlmutter, B. A.; Radul, A. A.; and Siskind, J. M. 2018. Automatic differentiation in machine learning: a survey. Journal of machine learning research , 18.
8Ben-Tal, El Ghaoui, and Nemirovski (2009) Ben-Tal, A.; El Ghaoui, L.; and Nemirovski, A. 2009. Robust optimization . Princeton university press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Implicit Bilevel Optimization: Differentiating through Bilevel Optimization Programming

Abstract

1 Introduction

Adversarial attack in Machine Learning

Min-max problems

Closed-loop control of physical systems

Interdiction problems

2 Differentiable Bilevel Optimization Layer

2.1 Continuous Bilevel Programming

2.2 Combinatorial Bilevel Programming

3 BiGrad: Gradient estimation

3.1 Continuous Optimization gradient estimation

Theorem 1**.**

Theorem 2**.**

3.2 Combinatorial Optimization gradient estimation

Theorem 3**.**

Differentiation of black box combinatorial solvers

Straight-Through gradient

4 Related Work

Bilevel Programming in machine learning

Quadratic, Cone and Convex single-level Programming

Implicit layer Networks

Combinatorial Optimization (CO)

Discrete latent variables

Predict then optimize

Neural Combinatorial Optimization (NCO)

5 Experiments

5.1 Optimal Control with adversarial disturbance

5.2 Adversarial ML with discrete latent variables

5.3 Dynamic Programming: Shortest path with Interdiction

5.4 Combinatorial Optimization: Travel Salesman Problem (TSP) with Interdiction

6 Conclusions

Ethical Statement and Limitations

Appendix A Supplementary Material;

A.1 Extension for linear equalities and non-linear inequalities

Linear Equality constraints

Non-linear Inequality constraints

A.2 Bilevel Cone programming

A.3 Proofs

Proof of Linear Equality constraints.

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

A.4 Gradient Estimation based on perturbation

A.5 Alternative derivation

A.6 Memory Efficiency

A.7 Experimental Setup and Computational Resources

A.8 Jacobian-Vector and Vector-Jacobian Products

Theorem 1.

Theorem 2.

Theorem 3.