Mixed-Variable Bayesian Optimization

Erik Daxberger; Anastasia Makarova; Matteo Turchetta; Andreas Krause

arXiv:1907.01329·cs.LG·August 5, 2020

Mixed-Variable Bayesian Optimization

Erik Daxberger, Anastasia Makarova, Matteo Turchetta, Andreas Krause

PDF

TL;DR

MiVaBo is a novel Bayesian optimization algorithm designed for efficient optimization of mixed-variable functions with discrete constraints, demonstrating superior sample efficiency in hyperparameter tuning tasks.

Contribution

Introduces MiVaBo, the first BO method capable of handling complex discrete constraints in mixed-variable domains with convergence guarantees.

Findings

01

MiVaBo outperforms existing methods in hyperparameter tuning tasks.

02

Provides the first convergence analysis for mixed-variable BO.

03

Efficiently handles complex discrete constraints in optimization.

Abstract

The optimization of expensive to evaluate, black-box, mixed-variable functions, i.e. functions that have continuous and discrete inputs, is a difficult and yet pervasive problem in science and engineering. In Bayesian optimization (BO), special cases of this problem that consider fully continuous or fully discrete domains have been widely studied. However, few methods exist for mixed-variable domains and none of them can handle discrete constraints that arise in many real-world applications. In this paper, we introduce MiVaBo, a novel BO algorithm for the efficient optimization of mixed-variable functions combining a linear surrogate model based on expressive feature representations with Thompson sampling. We propose an effective method to optimize its acquisition function, a challenging problem for mixed-variable domains, making MiVaBo the first BO method that can handle complex…

Tables4

Table 1. Table 1 : Hyperparameters of the XGBoost algorithm. 10 parameters, 7 of which are continuous, and 3 of which are discrete.

Name	Type	Domain
booster	discr.	[’gbtree’, ’gblinear’]
nrounds	discr.	$[3, 5000]$
alpha	contin.	$[0.000985, 1009.209690]$
lambda	contin.	$[0.000978, 999.020893]$
colsample_bylevel	contin.	$[0.046776, 0.998424]$
colsample_bytree	contin.	$[0.062528, 0.999640]$
eta	contin.	$[0.000979, 0.995686]$
max_depth	discr.	$[1, 15]$
min_child_weight	contin.	$[1.012169, 127.041806]$
subsample	contin.	$[0.100215, 0.999830]$

Table 2. Table 2 : Hyperparameters of the VAE. The architecture of the VAE (if all layers are enabled) is C1-C2-F1-F2-z-F3-F4-D1-D2 , with C denoting a convolutional (conv.) layer, F a fully-connected (fc.) layer, D a deconvolutional (deconv.) layer and z the latent space. Layers F2 and F3 have fixed sizes of 2 d z 2 subscript 𝑑 𝑧 2d_{z} and d z subscript 𝑑 𝑧 d_{z} units respectively, where d z subscript 𝑑 𝑧 d_{z} denotes the dimensionality of the latent space z . The domain of the number of units of the fc. layers F1 and F4 is discretized with a step size of 64, i.e. [ 0 , 64 , 128 , … , 832 , 896 , 960 ] 0 64 128 … 832 896 960 [0,64,128,\ldots,832,896,960] , denoted by [ 0 … 960 ] delimited-[] 0 … 960 [0\ldots 960] in the table for brevity. For d z subscript 𝑑 𝑧 d_{z} , the domain [ 16 … 64 ] delimited-[] 16 … 64 [16\ldots 64] refers to all integers within that interval.

#	Name	Type	Domain	Bits
1	Number of conv. layers in encoder	discrete	[0,1,2]	2
	Parameters of C1
2	Number of channels of C1	discrete	[4,8,16,24]	2
3	Stride of C1	discrete	[1,2]	1
4	Filter size of C1	discrete	[3,5]	1
5	Padding of C1	discrete	[0,1,2,3]	2
	Parameters of C2
6	Number of channels of C2	discrete	[8,16,32,48]	2
7	Stride of C2	discrete	[1,2]	1
8	Filter size of C2	discrete	[3,5]	1
9	Padding of C2	discrete	[0,1,2,3]	2
10	Number of fc. layers in encoder	discrete	[0,1,2]	2
11	Number of units of F1	discrete	[0…960]	4
12	Dimensionality $d_{z}$ of z	discrete	[16…64]	6
13	Number of fc. layers in decoder	discrete	[0,1,2]	2
14	Number of units of F4	discrete	[0…960]	4
15	Number of deconv. layers in decoder	discrete	[0,1,2]	2
	Parameters of D1
16	Number of channels of D1	discrete	[8,16,32,48]	2
17	Stride of D1	discrete	[1,2]	1
18	Filter size of D1	discrete	[3,5]	1
19	Padding of D1	discrete	[0,1,2,3]	2
20	Output padding of D1	discrete	[0,1,2,3]	2
	Parameters of D2
21	Number of channels of D2	discrete	[4,8,16,24]	2
22	Stride of D2	discrete	[1,2]	1
23	Filter size of D2	discrete	[3,5]	1
24	Padding of D2	discrete	[0,1,2,3]	2
25	Output padding of D2	discrete	[0,1,2,3]	2
26	Learning rate	continuous	$[10^{- 4}, 10^{- 2}]$	-
27	Learning rate decay factor	continuous	$[0.5, 1.0]$	-
28	Weight decay regularization	continuous	$[10^{- 6}, 10^{- 2}]$	-
	Total			50

Table 3. Table 3 : Mean plus/minus one standard deviation of the negative test log-likelihood over 8 random initializations, achieved by the best VAE configuration found by SMAC, TPE and GPyOpt after 16 BO iterations, for constraint violation penalties of 500, 250 and 125 nats. Performance values of MiVaBo and random search (which are not affected by the penalty) are included for reference.

Algorithm	Penalty (nats)
	500	250	125
SMAC	$113.0 \pm 1.8$	$112.1 \pm 1.8$	$111.1 \pm 1.6$
TPE	$108.8 \pm 1.2$	$108.1 \pm 1.3$	$108.1 \pm 1.3$
GPyOpt	$108.5 \pm 1.1$	$108.5 \pm 0.6$	$106.5 \pm 1.4$
RS			$106.3 \pm 0.9$
MiVaBo			$94.4 \pm 0.8$

Table 4. Table 4 : Mean plus/minus one standard deviation of the number of constraint violations by SMAC, TPE, GPyOpt and random search within 16 BO iterations over 8 random initializations, for constraint violation penalties of 500, 250 and 125 nats.

Algorithm	Penalty (nats)
	500	250	125
SMAC	$37 \pm 21.7$	$36 \pm 21.9$	$28 \pm 11.6$
TPE	$67 \pm 21.3$	$68 \pm 22.2$	$68 \pm 22.2$
GPyOpt	$36 \pm 19.3$	$32 \pm 18.0$	$27 \pm 10.4$
Random search	$71 \pm 25.5$	$71 \pm 25.5$	$71 \pm 25.5$

Equations58

min_{x \in X} f (x) s.t. g^{c} (x) \geq 0, g^{d} (x) \geq 0,

min_{x \in X} f (x) s.t. g^{c} (x) \geq 0, g^{d} (x) \geq 0,

f (x) = \sum_{j \in {d, c, m}} w^{j}^{⊤} ϕ^{j} (x^{j})

f (x) = \sum_{j \in {d, c, m}} w^{j}^{⊤} ϕ^{j} (x^{j})

W_{out}^{e} = (W_{in}^{e} - F^{e} + P^{e}) / S^{e} + 1 \in N

W_{out}^{e} = (W_{in}^{e} - F^{e} + P^{e}) / S^{e} + 1 \in N

W_{out1}^{e} = (28 - F_{1}^{e} + P_{1}^{e}) / S_{1}^{e} + 1

W_{out1}^{e} = (28 - F_{1}^{e} + P_{1}^{e}) / S_{1}^{e} + 1

W_{out2}^{e} = (W_{out1}^{e} - F_{2}^{e} + P_{2}^{e}) / S_{2}^{e} + 1

W_{out}^{d} = (W_{in}^{d} - 1) \times S^{d} + F^{d} - 2 P^{d} + O^{d}

W_{out}^{d} = (W_{in}^{d} - 1) \times S^{d} + F^{d} - 2 P^{d} + O^{d}

W_{out}^{d} = (W_{in}^{d} - 1) \times S_{1}^{d} + F_{1}^{d} - 2 P_{1}^{d} + O_{1}^{d},

W_{out}^{d} = (W_{in}^{d} - 1) \times S_{1}^{d} + F_{1}^{d} - 2 P_{1}^{d} + O_{1}^{d},

28 = (W_{out}^{d} - 1) \times S_{2}^{d} + F_{2}^{d} - 2 P_{2}^{d} + O_{2}^{d},

MAP (θ) = x max ⎩ ⎨ ⎧ θ^{d} (x^{d}) + θ^{c} (x^{c}) + f \in F \sum θ_{f}^{m} (x_{f}^{d}, x_{f}^{c}) ⎭ ⎬ ⎫

MAP (θ) = x max ⎩ ⎨ ⎧ θ^{d} (x^{d}) + θ^{c} (x^{c}) + f \in F \sum θ_{f}^{m} (x_{f}^{d}, x_{f}^{c}) ⎭ ⎬ ⎫

x, x^{F} max ⎩ ⎨ ⎧ θ^{d} (x^{d}) + θ^{c} (x^{c}) + f \in F \sum θ_{f}^{m} (x_{f}^{df}, x_{f}^{c f}) ⎭ ⎬ ⎫

x, x^{F} max ⎩ ⎨ ⎧ θ^{d} (x^{d}) + θ^{c} (x^{c}) + f \in F \sum θ_{f}^{m} (x_{f}^{df}, x_{f}^{c f}) ⎭ ⎬ ⎫

x_{i}^{df} = x_{i}^{d}, \forall f \in F, i \in f^{d}

x_{j}^{c f} = x_{j}^{c}, \forall f \in F, j \in f^{c}

L (λ, x, x^{F}) = θ^{d} (x^{d}) + θ^{c} (x^{c}) + f \in F \sum θ_{f}^{m} (x_{f}^{df}, x_{f}^{c f})

L (λ, x, x^{F}) = θ^{d} (x^{d}) + θ^{c} (x^{c}) + f \in F \sum θ_{f}^{m} (x_{f}^{df}, x_{f}^{c f})

+ f \in F \sum i \in f^{d} \sum λ_{f i}^{d} (x_{i}^{d} - x_{i}^{df}) + f \in F \sum j \in f^{c} \sum λ_{f j}^{c} (x_{j}^{c} - x_{j}^{c f})

= θ^{d} (x^{d}) + f \in F \sum i \in f^{d} \sum λ_{f i}^{d} x_{i}^{d}

+ θ^{c} (x^{c}) + f \in F \sum j \in f^{c} \sum λ_{f j}^{c} x_{j}^{c}

+ f \in F \sum θ_{f}^{m} (x_{f}^{df}, x_{f}^{c f}) - i \in f^{d} \sum λ_{f i}^{d} x_{i}^{df} - j \in f^{c} \sum λ_{f j}^{c} x_{j}^{c f} .

x, x^{F} max L (λ, x, x^{F})

x, x^{F} max L (λ, x, x^{F})

x_{i}^{df} = x_{i}^{d}, \forall f \in F, i \in f^{d}

x_{j}^{c f} = x_{j}^{c}, \forall f \in F, j \in f^{c}

L (λ) = x, x^{F} max L (λ, x, x^{F})

L (λ) = x, x^{F} max L (λ, x, x^{F})

= x^{d} max θ^{d} (x^{d}) + f \in F \sum i \in f^{d} \sum λ_{f i}^{d} x_{i}^{d}

+ x^{c} max θ^{c} (x^{c}) + f \in F \sum j \in f^{c} \sum λ_{f j}^{c} x_{j}^{c}

+ f \in F \sum x_{f}^{df}, x_{f}^{c f} max θ_{f}^{m} (x_{f}^{df}, x_{f}^{c f}) - i \in f^{d} \sum λ_{f i}^{d} x_{i}^{df} - j \in f^{c} \sum λ_{f j}^{c} x_{j}^{c f}

L (λ) = x^{d} max θ^{d} (x^{d}) + f \in F \sum i \in f^{d} \sum λ_{f i}^{d} x_{i}^{d}

L (λ) = x^{d} max θ^{d} (x^{d}) + f \in F \sum i \in f^{d} \sum λ_{f i}^{d} x_{i}^{d}

+ x^{c} max θ^{c} (x^{c}) + f \in F \sum j \in f^{c} \sum λ_{f j}^{c} x_{j}^{c}

+ f \in F \sum x_{f}^{d}, x_{f}^{c} max θ_{f}^{m} (x_{f}^{d}, x_{f}^{c}) - i \in f^{d} \sum λ_{f i}^{d} x_{i}^{d} - j \in f^{c} \sum λ_{f j}^{c} x_{j}^{c}

\overset{ˉ}{θ}_{d}^{λ} (x^{d})

\overset{ˉ}{θ}_{d}^{λ} (x^{d})

= θ^{d} (x^{d}) + f \in F \sum λ_{f}^{d} x_{∣ f}^{d}

\overset{ˉ}{θ}_{c}^{λ} (x^{c})

= θ^{c} (x^{c}) + f \in F \sum λ_{f}^{c} x_{∣ f}^{c}

\overset{ˉ}{θ}_{f}^{λ} (x_{f}^{d}, x_{f}^{c})

= θ_{f}^{m} (x_{f}^{d}, x_{f}^{c}) - λ_{f}^{d} x_{f}^{d} - λ_{f}^{c} x_{f}^{c}

L (λ) = x^{d} max \overset{ˉ}{θ}_{d}^{λ} (x^{d}) + x^{c} max \overset{ˉ}{θ}_{c}^{λ} (x^{c}) + f \in F \sum x_{f}^{d}, x_{f}^{c} max \overset{ˉ}{θ}_{f}^{λ} (x_{f}^{d}, x_{f}^{c}) .

L (λ) = x^{d} max \overset{ˉ}{θ}_{d}^{λ} (x^{d}) + x^{c} max \overset{ˉ}{θ}_{c}^{λ} (x^{c}) + f \in F \sum x_{f}^{d}, x_{f}^{c} max \overset{ˉ}{θ}_{f}^{λ} (x_{f}^{d}, x_{f}^{c}) .

MAP (θ) \leq L (λ) .

MAP (θ) \leq L (λ) .

λ min L (λ)

λ min L (λ)

s^{d} (λ)

s^{d} (λ)

s^{c} (λ)

s^{f} (λ)

L (λ) = s^{d} (λ) + s^{c} (λ) + f \in F \sum s^{f} (λ) .

L (λ) = s^{d} (λ) + s^{c} (λ) + f \in F \sum s^{f} (λ) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Mixed-Variable Bayesian Optimization

Erik Daxberger111Equal contribution. †Work done while at ETH Zurich. ,†,1,2

Anastasia Makarova*∗,3*

Matteo Turchetta2,3&Andreas Krause3 1Department of Engineering, University of Cambridge

2Max Planck Institute for Intelligent Systems, Tübingen

3Department of Computer Science, ETH Zurich

[email protected],{anmakaro,matteo.turchetta,krausea}@inf.ethz.ch

Abstract

The optimization of expensive to evaluate, black-box, mixed-variable functions, i.e. functions that have continuous and discrete inputs, is a difficult and yet pervasive problem in science and engineering. In Bayesian optimization (BO), special cases of this problem that consider fully continuous or fully discrete domains have been widely studied. However, few methods exist for mixed-variable domains and none of them can handle discrete constraints that arise in many real-world applications. In this paper, we introduce MiVaBo, a novel BO algorithm for the efficient optimization of mixed-variable functions combining a linear surrogate model based on expressive feature representations with Thompson sampling. We propose an effective method to optimize its acquisition function, a challenging problem for mixed-variable domains, making MiVaBo the first BO method that can handle complex constraints over the discrete variables. Moreover, we provide the first convergence analysis of a mixed-variable BO algorithm. Finally, we show that MiVaBo is significantly more sample efficient than state-of-the-art mixed-variable BO algorithms on several hyperparameter tuning tasks, including the tuning of deep generative models.

1 Introduction

Bayesian optimization (BO) movckus1975 is a well-established paradigm to optimize costly-to-evaluate, complex, black-box objectives that has been successfully applied to many scientific domains. Most of the existing BO literature focuses on objectives that have purely continuous domains, such as those arising in tuning of continuous hyperparameters of machine learning algorithms, recommender systems, and preference learning shahriari16. More recently, problems with purely discrete domains, such as food safety control and model-sparsification in multi-component systems baptista2018 have been considered.

However, many real-world optimization problems in science and engineering are of mixed-variable nature, involving both continuous and discrete input variables, and exhibit complex constraints. For example, tuning the hyperparameters of a convolutional neural network involves both continuous variables, e.g., learning rate and momentum, and discrete ones, e.g., kernel size, stride and padding. Also, these hyperparameters impose validity constraints, as some combinations of kernel size, stride and padding define invalid networks. Further examples of mixed-variable, potentially constrained, optimization problems include sensor placement krause2008, drug discovery negoescu2011, optimizer configuration hutter2011 and many others. Nonetheless, only few BO methods can address the unconstrained version of such problem and no existing method can handle the constrained one. This work introduces the first algorithm that can efficiently optimize mixed-variable functions subject to known constraints with provable convergence guarantees.

Related Work.

Extending continuous BO methods shahriari16 to mixed inputs requires ad-hoc relaxation methods to map the problem to a fully continuous one and rounding methods to map the solution back. This ignores the original domain structure, makes the solution quality dependent on the relaxation and rounding methods, and makes it hard to handle discrete constraints. Extending discrete BO methods baptista2018; oh2019 to mixed inputs requires a discretization of the continuous domain part, the granularity of which is crucial: If it is too small, the domain becomes prohibitively large; if it is too large, the domain may only contain poorly performing values of the continuous inputs. Few BO methods address the mixed-variable setting. SMAC hutter2011 uses a random forest surrogate model. However, its frequentist uncertainty estimates may be too inaccurate to steer the sampling. TPE bergstra2011a uses kernel density estimation to find inputs that will likely improve upon and unlikely perform worse than the incumbent solution. While SMAC and TPE can handle hierarchical constraints, they cannot handle more general constraints over the discrete variables, e.g., cardinality constraints. They also lack convergence guarantees. Hyperband (HB) li2016 uses cheap but less accurate approximations of the objective to dynamically allocate resources for function evaluations. BOHB falkner2018 is the model-based counterpart of HB, based on TPE. They thus extend existing mixed-variable methods to the multi-fidelity setting rather than proposing new ones, which is complementary to our approach, rather than in competition with it. garrido2018 propose a Gaussian process kernel to model discrete inputs without rounding bias. Their method lacks guarantees and cannot handle discrete constraints. We instead use discrete optimizers for the acquisition function, which avoid bias by only making integer evaluations. Finally, while hernandez2015c; gardner2014; sui2015safe extend continuous BO methods to handle unknown constraints, no method can handle known discrete constraints in a mixed-variable domain.

Contributions.

We introduce MiVaBo, the first BO algorithm for efficiently optimizing mixed-variable functions subject to known linear and quadratic integer constraints, encompassing many of the constraints present in real-world domains (e.g. cardinality, budget and hierarchical constraints). It relies on a linear surrogate model that decouples the continuous, discrete and mixed components of the function using an expressive feature expansion (Sec. 3.1). We exploit the ability of this model to efficiently draw samples from the posterior over the objective (Sec. 3.2) by combining it with Thompson sampling, and show how to optimize the resulting constrained acquisition function (Sec. 3.3). While in continuous BO, optimizing the acquisition function is difficult but has well-established solutions, this is not true for mixed-variable spaces and doing this efficiently and accurately is a key challenge that hugely impacts the algorithm’s performance. We also provide the first convergence analysis of a mixed-variable BO algorithm (Sec. 3.5). Finally, we demonstrate the effectiveness of MiVaBo on a set of complex hyperparameter tuning tasks, where it outperforms state-of-the-art methods and is competitive with human experts (Sec. 4).

2 Problem Statement

We consider the problem of optimizing an unknown, costly-to-evaluate function defined over a mixed-variable domain, accessible through noisy evaluations and subject to known linear and quadratic constraints. Formally, we aim to solve

[TABLE]

where $\mathcal{X}\subseteq\mathcal{X}^{c}\times\mathcal{X}^{d}$ with continuous subspace $\mathcal{X}^{c}$ and discrete subspace $\mathcal{X}^{d}$ . Both constraints $g^{c}(\mathbf{x})\geq 0$ over $\mathcal{X}^{c}$ and $g^{d}(\mathbf{x})\geq 0$ over $\mathcal{X}^{d}$ are known, and specifically $g^{d}(\mathbf{x})$ are linear or quadratic. We assume, that the domain of the continuous inputs is box-constrained and can thus, w.l.o.g., be scaled to the unit hypercube, $\mathcal{X}^{c}=[0,1]^{D_{c}}$ . We further assume, w.l.o.g., that the discrete inputs are binary, i.e., vectors $\mathbf{x}^{d}\in\mathcal{X}^{d}=\{0,1\}^{D_{d}}$ are vertices of the unit hypercube. This representation can effectively capture the domain of any discrete function. For example, a vector $\mathbf{x}^{d}=[x^{d}_{i}]_{i=1}^{D_{d}}\in\mathcal{X}^{d}$ can encode a subset $A$ of a ground set of $D_{d}$ elements, such that $x^{d}_{i}=1\Leftrightarrow a_{i}\in A$ and $x^{d}_{i}=0\Leftrightarrow a_{i}\notin A$ , yielding a set function. Alternatively, $\mathbf{x}^{d}\in\mathcal{X}^{d}$ can be a binary encoding of integer variables, yielding a function defined over integers.

Background.

BO algorithms are iterative black-box optimization methods which, at every step $t$ , select an input $\mathbf{x}_{t}\in\mathcal{X}$ and observe a noise-perturbed output $y_{t}\triangleq f(\mathbf{x}_{t})+\epsilon$ with $\epsilon\overset{\text{iid}}{\sim}\mathcal{N}(0,\beta^{-1})$ , $\beta>0$ . As evaluating $f$ is costly, the goal is to query inputs based on past observations to find a global minimizer $\mathbf{x}_{*}\in\operatorname{arg\,min}_{\mathbf{x}\in\mathcal{X}}f(\mathbf{x})$ as efficiently and accurately as possible. To this end, BO algorithms leverage two components: (i) a probabilistic function model (or surrogate), that encodes the belief about $f$ based on the observations available, and (ii) an acquisition function $\alpha:\mathcal{X}\rightarrow\mathbb{R}$ that expresses the informativeness of input $\mathbf{x}$ about the location of $\mathbf{x}_{*}$ , given the surrogate of $f$ . Based on the model of $f$ , we query the best input measured by the acquisition function, then update the model with the observation and repeat this procedure. The goal of the acquisition function is to simultaneously learn about inputs that are likely to be optimal and about poorly explored regions of the input space, i.e., to trade-off exploitation against exploration. Thus, BO reduces the original hard black-box optimization problem to a series of cheaper problems $\mathbf{x}_{t}\in\operatorname{arg\,max}_{\mathbf{x}\in\mathcal{X}}\alpha_{t}(\mathbf{x})$ . However, in our case, these optimization problems involve mixed variables and exhibit linear and quadratic constraints and are thus still challenging. We now present MiVaBo, an algorithm to efficiently solve the optimization problem in Eq. 1.

3 MiVaBo Algorithm

We first introduce the linear model used to represent the objective (Sec. 3.1) and describe how to do inference with it (Sec. 3.2). We then show how to use Thompson sampling to query informative inputs (Sec. 3.3) and, finally, provide a bound on the regret incurred by MiVaBo. (Sec. 3.5).

3.1 Model

We propose a surrogate model that accounts for both discrete and continuous variables in a principled way, while balancing two conflicting goals: Model expressiveness versus feasibility of Bayesian inference and of the constrained optimization of the mixed-variable acquisition function. Linear models defined over non-linear feature mappings, $f(\mathbf{x})=\mathbf{w}^{\top}\boldsymbol{\phi}(\mathbf{x})$ , are a class of flexible parametric models that strike a good trade-off between model capacity, interpretability and ease of use through the definition of features $\boldsymbol{\phi}:\mathcal{X}\rightarrow\mathbb{R}^{M}$ . While the complexity of the model is controlled by the number of features, $M$ , its capacity depends on their definition. Therefore, to make the design of a set of expressive features more intuitive, we treat separately the contribution to the objective $f$ from the discrete part of the domain, from the continuous part of the domain, and from the interaction of the two,

[TABLE]

where, for $j\in\{d,c,m\}$ , $\boldsymbol{\phi}^{j}(\mathbf{x}^{j})=[\phi^{j}_{i}(\mathbf{x}^{j})]_{i=1}^{M_{j}}\in\mathbb{R}^{M_{j}}$ and $\mathbf{w}^{j}\in\mathbb{R}^{M_{j}}$ are the feature and weight vector for the $d$ iscrete, $c$ ontinuous and $m$ ixed function component, respectively.

In many real-world domains, a large set of features can be discarded a priori to simplify the design space. It is common practice in high-dimensional BO to assume that only low-order interactions between the variables contribute significantly to the objective, which was shown for many practical problems rolland2018; mutny2018, including deep neural network hyperparameter tuning hazan2017. Similarly, we focus on features defined over small subsets of the inputs. Formally, we consider $\boldsymbol{\phi}(\mathbf{x})=[\phi_{k}(\mathbf{x}_{k})]_{k=1}^{M}$ , where $\mathbf{x}_{k}$ is a subvector of $\mathbf{x}$ containing exclusively continuous or discrete variables or a mix of both. Thus, the objective $f(\mathbf{x})$ can be decomposed into a sum of low-dimensional functions $f_{k}(\mathbf{x}_{k})\triangleq w_{k}\phi_{k}(\mathbf{x}_{k})$ defined over subspaces $\mathcal{X}_{k}\subseteq\mathcal{X}$ with $\text{dim}(\mathcal{X}_{k})\ll\text{dim}(\mathcal{X})$ . This defines a generalized additive model rolland2018; hastie2017, where the same variable can be included in multiple subvectors/features. The complexity of this model is controlled by the effective dimensionality (ED) of the subspaces, which is crucial under limited computational resources. In particular, let $\bar{D}_{d}\triangleq\max_{k\in[M]}\text{dim}(\mathcal{X}^{d}_{k})$ denote the ED of the discrete component in Eq. 2, i.e. the dimensionality of the largest subspace that exclusively contains discrete variables. Analogously, $\bar{D}_{c}$ and $\bar{D}_{m}$ denote the EDs of the continuous and mixed component, respectively. Intuitively, the ED corresponds to the maximum order of the variable interactions present in $f$ . Then, the number of features $M\in\mathcal{O}\big{(}D_{d}^{\bar{D}_{d}}+D_{c}^{\bar{D}_{c}}+(D_{d}+D_{c})^{\bar{D}_{m}}\big{)}$ scales exponentially in the EDs only (as modeling up to $L$ -th order interactions of $N$ inputs requires $\sum_{l=0}^{L}\binom{N}{l}\in\mathcal{O}(N^{L})$ terms), which are usually small, even if the true dimensionality is large.

Discrete Features $\boldsymbol{\phi}^{d}$ .

We aim to define features $\phi^{d}$ that can effectively represent the discrete component of Eq. 2 as a linear function, which should generally be able to capture arbitrary interactions between the discrete variables. To this end, we consider all subsets $S$ of the discrete variables in $\mathcal{X}^{d}$ (or, equivalently, all elements $S$ of the powerset $2^{\mathcal{X}_{d}}$ of $\mathcal{X}_{d}$ ) and define a monomial $\prod_{j\in S}x^{d}_{j}$ for each subset $S$ (where for $S=\emptyset$ , $\prod_{j\in\emptyset}x^{d}_{j}=1$ ). We then form a weighted sum of all monomials to yield the multi-linear polynomial ${\mathbf{w}^{d}}^{\top}\boldsymbol{\phi}^{d}(\mathbf{x}^{d})=\sum_{S\in 2^{\mathcal{X}_{d}}}w_{S}\prod_{j\in S}x^{d}_{j}$ . This functional representation corresponds to the Fourier expansion of a pseudo-Boolean function (PBF) boros2002. In practice, an exponential number of features can be prohibitively expensive and may lead to high-variance estimators as in BO one typically does not have access to enough data to robustly fit a large model. Alternatively, baptista2018; hazan2017 empirically found that a second-order polynomial in the Fourier basis provides a practical balance between expressiveness and efficiency, even when the true function is of higher order. In our model, we also consider quadratic PBFs, ${\mathbf{w}^{d}}^{\top}\boldsymbol{\phi}^{d}(\mathbf{x}^{d})=w_{\emptyset}+\sum_{i=1}^{n}w_{\{i\}}x^{d}_{i}+\sum_{1\leq i<j\leq n}w_{\{i,j\}}x^{d}_{i}x^{d}_{j}$ , which induces the discrete feature representation $\boldsymbol{\phi}^{d}(\mathbf{x}^{d})\triangleq[1,\{x^{d}_{i}\}_{i=1}^{D_{d}},\{x^{d}_{i}x^{d}_{j}\}_{1\leq i<j\leq D_{d}}]^{\top}$ and reduces the number of model weights to $M_{d}\in\mathcal{O}(D_{d}^{2})$ .

Continuous Features $\boldsymbol{\phi}^{c}$ .

In BO over continuous spaces, most approaches are based on Gaussian process (GP) models williams2006gaussian due to their flexibility and ability to capture large classes of continuous functions. To fit our linear model formulation, we leverage GPs’ expressiveness by modeling the continuous part of our model in Eq. 2 using feature expansions $\boldsymbol{\phi}^{c}(\mathbf{x}^{c})$ that result in a finite linear approximation of a GP. One simple, yet theoretically sound, choice is the class of Random Fourier Features (RFFs) rahimi2008, which use Monte Carlo integration for a randomized approximation of a GP. Alternatively, one can use Quadrature Fourier Features mutny2018, which instead use numerical integration for a deterministic approximation, which is particularly effective for problems with low effective dimensionality. Both feature classes were successfully used in BO jenatton2017; mutny2018. In our experiments, we use RFFs approximating a GP with a squared exponential kernel, which we found to best trade off complexity vs. accuracy in practice.

Mixed Features $\boldsymbol{\phi}^{m}$ .

The mixed term should capture as rich and realistic interactions between the discrete and continuous variables as possible, while keeping model inference and acquisition function optimization efficient. To this end, we stack products of all pairwise combinations of features of the two variable types, i.e. $\boldsymbol{\phi}^{m}(\mathbf{x}^{d},\mathbf{x}^{c})\triangleq[\phi^{d}_{i}(\mathbf{x}^{d})\cdot\phi^{c}_{j}(\mathbf{x}^{c})]^{\top}_{1\leq i\leq M_{d},1\leq j\leq M_{c}}$ . This formulation provides a good trade-off between modeling accuracy and computational complexity. In particular, it allows us to reduce $\boldsymbol{\phi}^{m}$ to the discrete feature representation $\boldsymbol{\phi}^{d}$ when conditioned on a fixed assignment of continuous variables $\boldsymbol{\phi}^{c}$ (and vice versa). This property is crucial for optimizing the acquisition function, as it allows us to optimize the mixed term of our model by leveraging the tools for optimizing the discrete and continuous parts individually. The proposed representation contains $M_{d}M_{c}$ features, resulting in a total of $M=M_{d}+M_{c}+M_{d}M_{c}$ . To reduce model complexity, prior knowledge about the problem can be incorporated into the construction of the mixed features. In particular, one may consider the following approaches. Firstly, one can exploit a known interaction structure between variables, e.g., in form of a dependency graph, and ignore the features that are known to be irrelevant. Secondly, one can start by including all of the proposed pairwise feature combinations and progressively discard not-promising ones. Finally, for high-dimensional problems, one can do the opposite and progressively add pairwise feature combinations, starting from the empty set.

3.2 Model Inference

Let $\mathbf{X}_{1:t}\in\mathbb{R}^{t\times D}$ be the matrix whose $i^{\text{th}}$ row contains the input $\mathbf{x}_{i}\in\mathcal{X}$ queried at iteration $i$ , $\dim\mathcal{X}=D$ , and let $\mathbf{y}_{1:t}=[y_{1},\dots,y_{t}]^{\top}\in\mathbb{R}^{t}$ be the array of the corresponding noisy function observations. Also, let $\mathbf{\Phi}_{1:t}\in\mathbb{R}^{t\times M}$ be the matrix whose $i^{\text{th}}$ row contains the featurized input $\boldsymbol{\phi}(\mathbf{x}_{i})\in\mathbb{R}^{M}$ . The formulation of $f$ in Eq. (2) and the noisy observation model induce the Gaussian likelihood $p(\mathbf{y}_{1:t}|\mathbf{X}_{1:t},\mathbf{w})=\mathcal{N}(\mathbf{\Phi}_{1:t}\mathbf{w},\beta^{-1}\mathbf{I})$ . To reflect our a priori belief about the weight vector $\mathbf{w}$ and thus $f$ , we specify a prior distribution over $\mathbf{w}$ . A natural choice for this is a zero-mean isotropic Gaussian prior $p(\mathbf{w}|\alpha)=\mathcal{N}(\mathbf{0},\alpha^{-1}\mathbf{I})$ , with precision $\alpha>0$ , which encourages $\mathbf{w}$ to be uniformly small, so that the final predictor is a sum of all features, each giving a small, non-zero contribution. Given the likelihood and prior, we infer the posterior $p(\mathbf{w}|\mathbf{X}_{1:t},\mathbf{y}_{1:t},\alpha,\beta)\propto p(\mathbf{y}_{1:t}|\mathbf{X}_{1:t},\mathbf{w},\beta)p(\mathbf{w}|\alpha)$ , which due to conjugacy is Gaussian, $p(\mathbf{w}|\mathbf{X}_{1:t},\mathbf{y}_{1:t})=\mathcal{N}(\mathbf{m},\mathbf{S}^{-1})$ , with mean $\mathbf{m}=\beta\mathbf{S}^{-1}\mathbf{\Phi}_{1:t}^{\top}\mathbf{y}_{1:t}\in\mathbb{R}^{M}$ and precision $\mathbf{S}=\alpha\mathbf{I}+\beta\mathbf{\Phi}_{1:t}^{\top}\mathbf{\Phi}_{1:t}\in\mathbb{R}^{M\times M}$ williams2006gaussian. This simple analytical treatment of the posterior distribution over $\mathbf{w}$ is a main benefit of this model, which can be viewed as a GP with a linear kernel in feature space.

3.3 Acquisition Function

We propose to use Thompson sampling (TS) thompson1933, which samples weights $\widetilde{\mathbf{w}}\sim p(\mathbf{w}|\mathbf{X}_{1:t},\mathbf{y}_{1:t},\alpha,\beta)$ from the posterior and chooses the next input by solving $\widehat{\mathbf{x}}\in\operatorname{arg\ min}_{\mathbf{x}\in\mathcal{X}}\widetilde{\mathbf{w}}^{\top}\boldsymbol{\phi}(\mathbf{x})$ . TS intuitively focuses on inputs that are plausibly optimal and has previously been successfully applied in both discrete and continuous domains baptista2018; mutny2018.

TS requires solving $\widehat{\mathbf{x}}\in\operatorname{arg\ min}_{\mathbf{x}\in\mathcal{X}}\widetilde{\mathbf{w}}_{t}^{\top}\boldsymbol{\phi}(\mathbf{x})$ , which is a challenging mixed-variable optimization problem. However, as $\widetilde{\mathbf{w}}_{t}^{\top}\boldsymbol{\phi}(\mathbf{x})$ decomposes as in Eq. (2), we can naturally use an alternating optimization scheme which iterates between optimizing the discrete variables $\mathbf{x}^{d}$ conditioned on a particular setting of the continuous variables $\mathbf{x}^{c}$ and vice versa, until convergence to some local optimum. While this scheme provides no theoretical guarantees, it is simple and thus widely and effectively applied in many contexts where the objective is hard to optimize. In particular, we iteratively solve $\widehat{\mathbf{x}}^{d}\in\operatorname{arg\ min}_{\mathbf{x}^{d}\in\mathcal{X}^{d}}\big{(}{{}\widetilde{\mathbf{w}}^{d}}^{\top}\boldsymbol{\phi}^{d}(\mathbf{x}^{d})+{{}\widetilde{\mathbf{w}}^{m}}^{\top}\boldsymbol{\phi}^{m}(\mathbf{x}^{d},\mathbf{x}^{c}=\widehat{\mathbf{x}}^{c})\big{)}$ , $\widehat{\mathbf{x}}^{c}\in\operatorname{arg\ min}_{\mathbf{x}^{c}\in\mathcal{X}^{c}}\big{(}{{}\widetilde{\mathbf{w}}^{c}}^{\top}\boldsymbol{\phi}^{c}(\mathbf{x}^{c})+{{}\widetilde{\mathbf{w}}^{m}}^{\top}\boldsymbol{\phi}^{m}(\mathbf{x}^{d}=\widehat{\mathbf{x}}^{d},\mathbf{x}^{c})\big{)}$ . Importantly, using the mixed features proposed in Sec. 3.1, these problems can be optimized by purely discrete and continuous optimizers, respectively. This also holds in the presence of mixed constraints $g^{m}(\mathbf{x})\geq 0$ if those decompose accordingly into discrete and continuous constraints.

This scheme leverages independent subroutines for discrete and continuous optimization: For the discrete part, we exploit the fact that optimizing a second-order pseudo-Boolean function is equivalent to a binary integer quadratic program (IQP) boros2002, allowing us to exploit commonly-used efficient and robust solvers such as Gurobi or CPLEX. While solving general binary IQPs is NP-hard boros2002, these optimizers are in practice very efficient for the dimensionalities we consider (i.e., $D_{d}<100$ ). This approach allows us to use any functionality offered by these tools, such as the ability to optimize objectives subject to linear constraints $\mathbf{A}\mathbf{x}^{d}\leq\mathbf{b}$ , $\mathbf{A}\in\mathbb{R}^{K\times D_{d}},\mathbf{b}\in\mathbb{R}^{K}$ or quadratic constraints ${\mathbf{x}^{d}}^{\top}\mathbf{Q}\mathbf{x}^{d}+\mathbf{q}^{\top}\mathbf{x}^{d}\leq b$ , $\mathbf{Q}\in\mathbb{R}^{D_{d}\times D_{d}},\mathbf{q}\in\mathbb{R}^{D_{d}},b\in\mathbb{R}$ . For the continuous part, one can use optimizers commonly used in continuous BO, such as L-BFGS or DIRECT. In our experiments, we use Gurobi as the discrete and L-BFGS as the continuous solver within the alternating optimization scheme, which we always run until convergence.

3.4 Model Discussion

BO algorithms are comprised of three major design choices: the surrogate model to estimate the objective, the acquisition function to measure informativeness of the inputs and the acquisition function optimizer to select queries. Due to the widespread availability of general-purpose optimizers for continuous functions, continuous BO is mostly concerned with the first two design dimensions. However, this is different for mixed-variable constrained problems. We show in Sec. 4 that using a heuristic optimizer for the acquisition function optimization leads to poor queries and, therefore, poor performance of the BO algorithm. Therefore, the tractability of the acquisition function optimization influences and couples the other design dimensions. In particular, the following considerations make the choice of a linear model and TS the ideal combination of surrogate and acquisition function for our problem. Firstly, the linear model is preferable to a GP with a mixed-variable kernel as the latter would complicate the acquisition function optimization for two reasons: (i) the posterior samples would be arbitrary nonlinear functions of the discrete variables and (ii) it would be non-trivial to evaluate them at arbitrary points in the domain. In contrast, our explicit feature expansion solves both problems, while second order interactions provide a valid discrete function representation baptista2018; hazan2017 and lead to tractable quadratic MIPs with capacity for complex discrete constraints. Moreover, Random Fourier Features approximate common GP kernels arbitrarily well, and inference in MiVaBo scales linearly with the number of data points, making it applicable in cases where GP inference, which scales cubically with the number of data points, would be prohibitive. Secondly, TS induces a simple relation between the surrogate and the resulting optimization problem for the acquisition function, allowing to trade off model expressiveness and optimization tractability, which is a key challenge in mixed-variable domains. Finally, the combination of TS and the linear surrogate facilitates the convergence analysis described in Sec. 3.5, making MiVaBo the first mixed-variable BO method with theoretical guarantees.

3.5 Convergence Analysis

Using a linear model and Thompson sampling, we can leverage convergence analysis from linearly parameterized multi-armed bandits, a well-studied class of methods for solving structured decision making problems abeille2017. These also assume the objective to be linear in features $\boldsymbol{\phi}(\mathbf{x})\in\mathbb{R}^{M}$ with a fixed but unknown weight vector $\mathbf{w}\in\mathbb{R}^{M}$ , i.e. $\mathbb{E}[f(\mathbf{x})|\boldsymbol{\phi}(\mathbf{x})]=\mathbf{w}^{\top}\boldsymbol{\phi}(\mathbf{x})$ , and aim to minimize the total regret up to time $T$ : $\mathcal{R}(T)=\sum_{t=1}^{T}(f(\mathbf{x}_{*})-f(\mathbf{x}_{t}))$ . We obtain the following regret bound for MiVaBo:

Proposition 1.

Assume that the following assumptions hold in every iteration $t=1,\ldots,T$ of the MiVaBo algorithm:

$\widetilde{\mathbf{w}}_{t}\sim\mathcal{N}(\mathbf{m},24M\ln T\ln\frac{1}{\delta}\mathbf{S}^{-1})$ , i.e. with scaled variance. 2. 2.

$\mathbf{x}_{t}=\arg\min_{\mathbf{x}}\widetilde{\mathbf{w}}^{\top}\boldsymbol{\phi}(\mathbf{x})$ * is selected exactly.222To this end, one can use more expensive but theoretically backed optimization methods instead of the alternating one, such as the powerful and popular dual decomposition sontag2011.* 3. 3.

$\|\widetilde{\mathbf{w}}_{t}\|_{2}\leq c,\|\boldsymbol{\phi}(\mathbf{x}_{t})\|_{2}\leq c,\|f(\mathbf{x}_{*})-f(\mathbf{x}_{t})\|_{2}\leq c\$ , $c\in\mathbb{R}^{+}$ .

Then, $\mathcal{R}(T)\leq\tilde{\mathcal{O}}\left(M^{3/2}\sqrt{T}\ln\frac{1}{\delta}\right)$ with probability $1-\delta$ .

Prop. 1 follows from Theorem 1 in abeille2017 and works for infinite arms $\mathbf{x}\in\mathcal{X},|\mathcal{X}|=\infty$ . In our setting, both the discrete and continuous Fourier features (and, thus, the mixed features) satisfy the standard boundedness assumption, such that the proof indeed holds. Prop. 1 implies no-regret, $\lim_{T\rightarrow\infty}\mathcal{R}(T)/T=0$ , i.e., convergence to the global minimum, since the minimum found after $T$ iterations is no further away from $f(\mathbf{x}_{*})$ than the mean regret $\mathcal{R}(T)/T$ . To our knowledge, MiVaBo is the first mixed-variable BO algorithm for which such a guarantee is known to hold.

4 Experiments

We present experimental results on tuning the hyperparameters of two machine learning algorithms, namely gradient boosting and a deep generative model, on multiple datasets.

Experimental Setup.

For MiVaBo333We provide a Python implementation of MiVaBo at https://github.com/edaxberger/mixed_variable_bo., we set the prior variance $\alpha$ , observation noise variance $\beta$ , and kernel bandwidth $\sigma$ to 1.0, and scale the variance as stated in Prop. 1. We compare against SMAC, TPE, random search, and the popular GPyOpt BO package. GPyOpt uses a GP model with the upper confidence bound acquisition function srinivas09, and accounts for mixed variables by relaxing discrete variables to be continuous and later rounding them to the nearest discrete neighbor. To separate the influence of model choice and acquisition function optimization, we also consider the MiVaBo model optimized by simulated annealing (SA) (MiVaBo-SA) and the GP approach optimized by SA (GP-SA). We compare against the SA-based variants only in constrained settings, using more principled methods in unconstrained ones. To handle constraints, SA assigns high energy values to invalid inputs, making the probability of moving there negligible. We use SMAC, TPE and GPyOpt and SA with their respective default settings.

4.1 Gradient Boosting Tuning

The OpenML database vanschoren2014 contains evaluations for various machine learning methods trained on several datasets with many hyperparameter settings. We consider extreme gradient boosting (XGBoost) chen2016, one of the most popular OpenML benchmarks, and tune its ten hyperparameters – three are discrete and seven continuous – to minimize the classification error on a held-out test set (without any constraints). We use two datasets, each containing more than $45000$ hyperparameter settings. To evaluate hyperparameter settings for which no data is available, we use a surrogate modeling approach based on nearest neighbor eggensperger2015, meaning that the objective returns the error of the closest (w.r.t. Euclidean distance) setting available in the dataset. Section 4 shows that MiVaBo achieves performance which is either significantly stronger than (left dataset) or competitive with (right dataset) the state-of-the-art mixed-variable BO algorithms on this challenging task. GPyOpt performs poorly, likely because it cannot account for discrete variables in a principled way. As compared to TPE and SMAC, MiVaBo seems to benefit from more sophisticated uncertainty estimation.

4.2 Deep Generative Model (DGM) Tuning

DGMs recently received considerable attention in the machine learning community. Despite their popularity and importance, effectively tuning their hyperparameters is a major challenge. We consider tuning the hyperparameters of a variational autoencoder (VAE) kingma2013 composed of a convolutional encoder and a deconvolutional decoder salimans2014. The VAEs are evaluated on stochastically binarized MNIST, as in burda2015, and FashionMNIST. They are trained on 60000 images for 32 epochs, using Adam with a mini-batch size of 128. We report the negative log-likelihood (NLL; in nats) achieved by the VAEs on a held-out test set of 10000 images, as estimated via importance sampling using 32 samples per test point. To our knowledge, no other BO paper considered DGM tuning.

VAE tuning is difficult due to the high-dimensional and structured nature of its hyperparameter space, and, in particular, due to constraints arising from dependencies between some of its parameters. We tune 25 discrete parameters defining the model architecture, e.g. the number of convolutional layers, their stride, padding and filter size, the number and width of fully-connected layers, and the latent space dimensionality. We further tune three continuous parameters for the optimizer and regularization. Crucially, mutual dependencies between the discrete parameters result in complex constraints, as certain combinations of stride, padding and filter size lead to invalid architectures. Particularly, for the encoder, the shapes of all layers must be integral, and for the decoder, the output shape must match the input data shape, i.e., one channel of size $28\times 28$ for {Fashion}MNIST. The latter constraint is especially challenging, as only a small number of decoder configurations yield the required output shape. Thus, even for rather simple datasets such as {Fashion}MNIST, tuning such a VAE is significantly more challenging than, say, tuning a convolutional neural network for classification.

While MiVaBo can conveniently capture these restrictions via linear and quadratic constraints, the competing methods cannot. To enable a comparison that is as fair as possible, we thus use the following sensible heuristic to incorporate the knowledge about the constraints into the baselines: If a method tries to evaluate an invalid parameter configuration, we return a penalty error value, which will discourage a model-based method to sample this (or a similar) setting again. However, for fairness, we only report valid observations and ignore all configurations that violated a constraint. We set the penalty value to 500 nats, which is the error incurred by a uniformly random generator. We investigated the impact of the penalty value (e.g., we also tried 250 and 125 nats) and found that it does not qualitatively affect the results.

Fig. 4 shows that MiVaBo significantly outperforms the competing methods on this task, both on MNIST (left) and FashionMNIST (right). This is because MiVaBo can naturally encode the constraints and thus directly optimize over the feasible region in parameter space, while TPE, SMAC and GPyOpt need to learn the constraints from data. They fail to do so and get stuck in bad local optima early on. The model-based approaches likely struggle due to sharp discontinuities in hyperparameter space induced by the constraint violation penalties (i.e., as invalid configurations may lie close to well-performing configurations). In contrast, random search is agnostic to these discontinuities, and thus notably outperforms the model-based methods. Lastly, GP-SA and MiVaBo-SA struggle as well, suggesting that while SA can avoid invalid inputs, the effective optimization of complex constrained objectives crucially requires more principled approaches for acquisition function optimization, such as the one we propose. This shows that all model choices for MiVaBo (as discussed in Sec. 3.4) are necessary to achieve such strong results.

Although log-likelihood scores allow for a quantitative comparison, they are hard to interpret for humans. Thus, for a qualitative comparison, Fig. 2 visualizes the reconstruction quality achieved on MNIST by the best VAE configuration found by all methods after 32 BO iterations. The VAEs were trained for 32 epochs each, as in Fig. 4. The likelihoods seem to correlate with the quality of appearance, and the model found by MiVaBo arguably produces the visually most appealing reconstructions among all models. Note that while MiVaBo requires more time than the baselines (see Fig. 4), this is still negligible compared to the cost of a function evaluation, which involves training a deep generative model. Finally, the best VAE found by MiVaBo achieves $84.25$ nats on MNIST when trained for 3280 epochs and using 5000 importance samples for log-likelihood estimation, i.e. the setting used in burda2015 (see Fig. 4). This is comparable to the performance of 82-83 nats achieved by human expert tuned models, e.g. as reported in salimans2014 (which use even more convolutional layers and a more sophisticated inference method), highlighting MiVaBo’s effectiveness in tuning complex deep neural network architectures.

5 Conclusion

We propose MiVaBo, the first method for efficiently optimizing expensive mixed-variable black-box functions subject to linear and quadratic discrete constraints. MiVaBo combines a linear model of expressive features with Thompson sampling, making it simple yet effective. Moreover, it is highly flexible due to the modularity of its components, i.e., the mixed-variable features, and the optimization oracles for the acquisition procedure. This allows practitioners to tailor MiVaBo to specific objectives, e.g. by incorporating prior knowledge in the feature design or by leveraging optimizers handling specific types of constraints. We show that MiVaBo enjoys theoretical convergence guarantees that competing methods lack. Finally, we empirically demonstrate that MiVaBo significantly improves optimization performance as compared to state-of-the-art methods for mixed-variable optimization on complex hyperparameter tuning tasks.

Acknowledgements

This research has been partially supported by SNSF NFP75 grant 407540_167189. Matteo Turchetta was supported through the ETH-MPI Center for Learning Systems. Erik Daxberger was supported through the EPSRC and Qualcomm. The authors thank Josip Djolonga, Mojmír Mutný, Johannes Kirschner, Alonso Marco Valle, David R. Burt, Ross Clarke, Wolfgang Roth as well as the anonymous reviewers of an earlier version of this paper for their helpful feedback.

References

Appendix A Sparse Linear Model via Sparsity-Encouraging Prior

While the number and degree of the features used in the surrogate model (see Section 3.1) is a design choice, in practice it is typically unknown which variable interactions matter and thus which features to choose. To discard irrelevant features, one may impose a sparsity-encouraging prior over the weight vector $\mathbf{w}$ baptista2018. However, due to non-conjugacy to the Gaussian likelihood, exact Bayesian inference of the resulting posterior distribution is in general intractable, imposing the need for approximate inference methods. One choice for such a prior is the Laplace distribution, i.e. $p(\mathbf{w}|\alpha)\propto\exp(-\alpha^{-1}\|\mathbf{w}\|_{1})$ , with inverse scale parameter $\alpha>0$ , for which approximate inference techniques based on expectation propagation minka2001 and variational inference wainwright2008 were developed in seeger2008a; seeger2008b; seeger2011. Alternatively, one can use a horseshoe prior and use Gibbs sampling to sample from the posterior over weights baptista2018. However, this comes with a significantly larger computational burden, which is a well-known issue for sampling based inference techniques bishop2006. Lastly, one may consider a spike-and-slab prior with expectation propagation for approximate posterior inference hernandez2013generalized; hernandez2015expectation.

Appendix B Pseudocode for Thompson Sampling

Algorithm 1 shows pseudocode for the Thompson sampling procedure within MiVaBo.

Appendix C Further Experimental Results

C.1 Synthetic Benchmark for Unconstrained Optimization

We assess the performance on an unconstrained synthetic linear benchmark function of the form $f(\mathbf{x})=\mathbf{w}^{\top}\boldsymbol{\phi}$ . We choose a fairly high-dimensional objective with $D_{d}=8$ discrete and $D_{c}=8$ continuous variables, thus resulting in a total input space dimensionality of $D=D_{d}+D_{c}=16$ . For the discrete model part, we choose the $M_{d}\in\mathcal{O}(D_{d}^{2})$ features $\boldsymbol{\phi}^{d}$ proposed in Section 3.1. For the continuous features $\boldsymbol{\phi}^{c}$ , we choose $M_{c}=16$ dimensional Random Fourier Features to approximate a GP with a squared exponential kernel with bandwidth $\sigma=1.0$ . For the mixed representation, we construct a feature vector $\boldsymbol{\phi}^{m}$ by stacking all pairs of discrete and continuous features, as proposed in Section 3.1. We consider two settings for the weight vector: Firstly, we sample it from a zero-mean Gaussian, $\mathbf{w}\sim\mathcal{N}(\mathbf{0},\mathbf{I})\in\mathbb{R}^{M}$ . Secondly, we sample it from a Laplace distribution, i.e. $\mathbf{w}\sim p(\mathbf{w}|\alpha)\propto\exp(-\alpha^{-1}\|\mathbf{w}\|_{1})$ , with inverse scale parameter $\alpha=0.1$ , and then prune all weights smaller than $10$ to zero to induce sparsity over the weight vector. For the second setting, we also assess MiVaBo using a Laplace prior and the approximate inference technique from seeger2011 (see also Appendix A)444We use the MATLAB implementation provided in the glm-ie toolbox nickisch2012 by the same authors.. As we do not know the true optimum of the function and thus cannot compute the regret, we normalize all observed function values to the interval $[0,1]$ , resulting in a normalized error as the metric of comparison. We can observe from our results shown in Fig. 5 that MiVaBo outperforms the competing methods in this setting, demonstrating the effectiveness of our approach when its modeling assumptions are fulfilled.

C.2 Synthetic Benchmark for Constrained Optimization

In another experiment, we demonstrate the capability of our algorithm to incorporate linear constraints on the discrete variables. In particular, we want to enforce a solution that is sparse in the discrete variables via adding a hard cardinality constraint of the type $\sum_{i=1}^{D_{d}}x^{d}_{i}\leq k$ , which we can simply specify in the Gurobi optimizer. Cardinality constraints of this type are very relevant in practice, as many real-world problems desire sparse solutions (e.g., sparsification of ising models, contamination control, aero-structural multi-component problems baptista2018). We consider the same functional form as before, i.e. again with $D_{d}=D_{c}=8$ , and set $k=2$ , meaning that a solution should have at most two of our binary variables set to one, while all others shall be set to zero. To enable comparison with TPE, SMAC and random search, which provide no capability of modeling these kinds of constraints, we assume the objective $f$ to be unconstrained, but instead return a large penalty value if a method acquires an evaluation of $f$ at a point that violates the constraint. Thus, the baseline algorithms are forced to learn the constraint from observations, which is a challenging problem.

One can notice from Fig. 6 that the ability to explicitly encode the cardinality constraint into the discrete optimization oracle significantly increases performance.

Appendix D More Details on XGBoost Hyperparameter Tuning Task

Please refer to the corresponding websites for details on the OpenML XGBoost benchmark (https://www.openml.org/f/6767), on the underlying implementation (https://www.rdocumentation.org/packages/xgboost/versions/0.6-4), and on the steel-plates-fault (https://www.openml.org/t/9967) and monks-problem-1 (https://www.openml.org/t/146064) datasets. Finally, see Table 1 for a description of the hyperparameters involved in XGBoost.

Appendix E More Details on VAE Hyperparameter Tuning Task

E.1 Hyperparameters of VAE

We used the PyTorch library to implement the VAE used in the experiment. Table 2 describes the names, types and domains of the involved hyperparameters that we tune. Whenever we refer to a ”deconvolutional layer” (also called transposed convolution or fractionally-strided convolution), we mean the functional mapping implemented by a ConvTranspose2d layer in PyTorch555See https://pytorch.org/docs/stable/nn.html#convtranspose2d for details.. Since our approach operates on a binary encoding of the discrete parameters, we also display the number of bits required to encode each discrete parameter. In total, we consider 25 discrete parameters (resulting in 50 when binarized) as well as three continuous ones.

E.2 Description of Constraints

We now describe the constraints arising from the mutual dependencies within the hyperparameter space of the deconvolutional VAE (as described in Section E.1).

Encoder constraints.

For the convolutional layers (up to two in our case) of the encoder, we need to ensure that the chosen combination of stride, padding and filter size transforms the input image into an output image whose shape is integral (i.e., not fractional). More precisely, denoting the input image size by $W_{\text{in}}$ (i.e., the input image is quadratic with shape $W_{\text{in}}\times W_{\text{in}}$ ), the stride by $S$ , the filter size by $F$ , and the padding by $P$ , we need to ensure that the output image size $W_{\text{out}}$ is integral, i.e.

[TABLE]

where superscripts $e$ are used to make clear that we are considering the encoder. Let us illustrate this with an example666This example is taken from http://cs231n.github.io/convolutional-networks/#conv (paragraph ”Constraints on strides”), which also describes the constraints discussed here. Note that they define the padding $P$ in a slightly different way (i.e., they only consider symmetric padding, while we also allow for asymmetric padding) and thus end up with a term of $2P$ instead of $P$ in the formula.: For $W_{\text{in}}=10$ , $P=0$ , $S=2$ and $F=3$ , we would get an invalid fractional output size of $W_{\text{out}}=(10-3+0)/2+1=4.5$ . To obtain a valid output size, one could, e.g., instead consider a padding of $P=1$ , yielding $W_{\text{out}}=(10-3+1)/2+1=5$ . Alternatively, one could also consider a stride of $S=1$ to obtain $W_{\text{out}}=(10-3+0)/1+1=8$ , or a filter size of $F=4$ to obtain $W_{\text{out}}=(10-4+0)/2+1=4$ (though the latter is very uncommon and thus not allowed in our setting; we only allow $F\in\{3,5\}$ , as described in Section E.1). While this constraint is not trivially fulfilled (which can be verified by manually trying different configurations of $W_{\text{in}},F,S,P$ ), it is also not too challenging to find valid configurations.

Note that this constraint is required to be fulfilled for every convolutional layer; we thus obtain the following two constraints in our specific two-layer setting, where $W_{\text{in}}=28$ (as MNIST and FashionMNIST images are of shape $28\times 28$ ):

[TABLE]

where the subscripts in $\{1,2\}$ denote the index of the convolutional layer.

Finally, observe that the constraints in Eq. (4) and Eq. (5) are, respectively, linear and quadratic in the discrete variables $F^{e}_{1},F^{e}_{2},P^{e}_{1},P^{e}_{2},S^{e}_{1},S^{e}_{2}$ , and can thus be readily incorporated into the integer programming solver (e.g. Gurobi optimization2014 or CPLEX cplex2009v12) we employ as a subroutine within our acquisition function optimization strategy.

Decoder constraints.

While the constraints on the decoder architecture are similar in nature to those for the encoder, they are significantly more difficult to fulfill, which we will now illustrate.

In particular, we need to ensure that the decoder produces images of shape $28\times 28$ . By inverting the formula in Eq. (3), we see that for a deconvolutional layer (which intuitively implements an inversion of the convolution operation), the output image size $W_{\text{out}}$ can be computed as

[TABLE]

where superscripts $d$ are used to make clear that we are considering the decoder, and where $O$ is an additional output padding parameter which can be used to adjust the shape of the output image777See e.g. https://pytorch.org/docs/stable/nn.html#convtranspose2d for a description of the output padding in the context of the PyTorch library we use.. Note that we now have a factor of $2P$ in Eq. (6) instead of $P$ (as for the encoder, i.e. in Eq. (3)), since we only consider symmetric padding for the decoder, while we allow for asymmetric padding for the encoder (to make it easier to fulfill the integrality constraints for the encoder due to an increased number of valid configurations). The output padding parameter $O$ is required since the mapping from $W^{e}_{\text{in}}$ to $W^{e}_{\text{out}}$ in a convolutional layer (i.e. in the encoder) is not bijective: there are different combinations of $W^{e}_{\text{in}},F,S,P$ that result in the same $W^{e}_{\text{out}}$ (which can be easily verified). Thus, given an output size $W^{e}_{\text{out}}$ (now serving as the input size $W^{d}_{\text{in}}$ of the deconvolutional layer), there is no unique corresponding input size $W^{e}_{\text{in}}$ (now serving as the output size $W^{d}_{\text{out}}$ of the deconvolutional layer). The output padding parameter $O$ can thus be used to disambiguate this relation. Note that $W^{d}_{\text{out}}$ in Eq. (6) is always integral, so there are no integrality constraints involved here, in constrast to the encoder.

In the context of our decoder model, i.e. with up to two deconvolutional layers, and with a required output image size of $28$ , we thus obtain the following constraints:

[TABLE]

i.e. we need to choose the parameters $F^{d}_{1},F^{d}_{2},P^{d}_{1},P^{d}_{2},S^{d}_{1},S^{d}_{2},O^{d}_{1},O^{d}_{2}$ such that the output size is 28, which is challenging, as only a small number of parameter configurations fulfill this property. While this problem is already challenging when assuming a given fixed input image shape $W_{\text{in}}^{d}$ , in our setting it is more difficult, as $W_{\text{in}}^{d}$ has to be of a suitable size as well. Note that $W_{\text{in}}^{d}$ is determined by the size of the fully-connected layer preceding the first deconvolutional layer, which yields an additional challenge: the size of the last fully-connected layer has to be set such that it can be resized to an image of shape $C^{d}_{1}\times W_{\text{in}}^{d}\times W_{\text{in}}^{d}$ (i.e., such that it can be fed into a deconvolutional layer), where $C^{d}_{1}$ denotes the number of channels of the first deconvolutional layer of the decoder. As the resulting problem would be too challenging for any algorithm to produce a valid solution in a reasonable amount of time, we simplify it slightly by only treating $C^{d}_{1}$ as a design parameter (as described in Section E.1), but keeping $W^{d}_{\text{in}}=7$ fixed. The value 7 is chosen since $16\times 7\times 7=784$ , i.e., when setting $C^{d}_{1}=16$ , the last fully-connected layer has the correct output shape (since $28\times 28=784$ for an MNIST and FashionMNIST image). This way, a valid decoder architecture can be achieved by deactivating all convolutional layers and choosing $C^{d}_{1}=16$ , constituting an alternative if fulfilling the decoder constraints in Eq. (7) and Eq. (8) is too challenging for an algorithm.

Finally, the constraints in Eq. (7) and Eq. (8) are, respectively, linear and quadratic in the discrete variables $F^{d}_{1},F^{d}_{2},P^{d}_{1},P^{d}_{2},S^{d}_{1},S^{d}_{2},O^{d}_{1},O^{d}_{2}$ , which again allows us to incorporate them into our optimization routine.

E.3 Effect of Different Constraint Violation Penalty Values

We now analyze the effect of the constraint violation penalty value on the performance of SMAC, TPE and GPyOpt. Note that random search and MiVaBo are not affected by the penalty. We do this analysis to show that the choice of penalty does not qualitatively affect the reported results. In addition to the penalty of 500 nats considered in the experiments in the main paper, we assessed two smaller alternative penalties of 250 nats and 125 nats, respectively. The results in Table 3 show that the performance of the methods improves marginally with decreasing penalty values. This can be intuitively explained by the fact that the smaller the penalty, the smaller the region in hyperparameter space that the penalty discourages from searching. In fact, a large penalty may not only discourage infeasible configurations, but also feasible configurations that lie ”close” to the penalized infeasible one (where closeness is defined by the specific surrogate model employed by the method). However, even for the smallest penalty of 125 nats, SMAC, TPE and GPyOpt still perform worse than random search, and thus still significantly worse than MiVaBo. Imposing penalties that are significantly smaller than 125 is not sensible, as this will encourage the model-based methods to violate the constraints, and in turn discourage them from ever evaluating a valid configuration (as this would yield a worse score).

Finally, Table 4 shows the number of constraint violations by the different methods, depending on the violation penalty.

E.4 Visualization of Reconstruction Quality

While log-likelihood scores allow for a principled quantitative comparison between different algorithms, they are typically hard to interpret for humans. We thus in Fig. 7 visualize the reconstruction quality achieved by the best VAE configuration found by the different methods after 32 BO iterations. The VAEs were trained for 32 epochs each (as in the BO experiments). The log-likelihood scores seem to be correlated with quality of visual appearance, and the model found by MiVaBo thus may be perceived to produce the visually most appealing reconstructions among all models.

Appendix F Discussion and Details on Baselines in Empirical Evaluation

We decided to compare against SMAC hutter2011 and TPE bergstra2011a, as these are state-of-the-art mixed-variable BO methods. We used their publicly available Python implementations under https://github.com/automl/SMAC3 (SMAC) and https://github.com/hyperopt/hyperopt (TPE). We furthermore compare against the popular popular GPyOpt BO Python package gonzalez2016 (https://github.com/SheffieldML/GPyOpt) as a reference implementation of a state-of-the-art continuous BO method (which extends to the mixed-variable setting via relaxation and rounding of the discrete variables). We use these Python packages with their respective default settings. Moreover, to isolate the benefit of the model choice from the acquisition function optimization procedure, we consider baselines that, respectively, use the MiVaBo and GP models, and optimize the resulting acquisition function using simulated annealing (SA) kirkpatrick1983. For the baseline that combines a GP with simulated annealing, we use the popular GPy Python package gpy2014 (which also serves as the GP backend of GPyOpt) together with the simulated annealing implementation at https://github.com/perrygeo/simanneal. Finally, we compare against random search (using a custom implementation due to its simplicity), which has been shown to be an effective baseline for hyperparameter optimization bergstra2011a.

There are several other methods which address problem settings related to the (constrained) mixed-variable paradigm we consider. We here briefly clarify why we decided to not compare against them in our empirical evaluation. Firstly, baptista2018; oh2019; kim2019bayesian extend BO to tackle purely discrete/combinatorial problems; these approaches can thus not straightforwardly handle the continuous variables present in mixed-variable problems. ru2019bayesian address BO problems with multiple continuous and categorical input variables (i.e. unordered ones), whereas MiVaBo includes ordered discrete variables such as integer variables. As pointed out in Section 1, Hyperband li2016 and BOHB falkner2018 are complementary to MiVaBo in that they do not propose new mixed-variable methods, but rather extend existing ones (such as random search and TPE) to the multi-fidelity setting; they should thus not be perceived as competing methods. The work of garrido2018 extends GP-based BO to integer variables, but cannot handle discrete constraints. While several works hernandez2015c; gardner2014; sui2015safe propose extensions of continuous BO methods to handle unknown constraints, they can neither handle mixed-variable problems nor known (discrete) constraints, and might thus again be viewed as complementary to our approach.

Finally, a recent line of work extends continuous BO methods to general highly structured input spaces such as graphs or images (which also includes mixed discrete-continuous problems), by first training a deep generative model such as a VAE on the input data, and then using standard continuous BO methods in the continuous latent space learned by the VAE gomez2018. This so-called latent space optimization approach has recently been successfully applied to application domains including automatic chemical design and automatic machine learning gomez2018; kusner2017; nguyen2016synthesizing; luo2018; lu2018; jin2018junction; tripp2020sample, and might thus be perceived to be a promising method for the mixed-variable hyperparameter tuning tasks we consider in this paper. However, despite these successes, the latent space optimization paradigm is at an early stage and current methods still suffer from critical shortcomings. One of the most severe issues is that the BO procedure tends to progress into regions of the latent space that are too far away from the regions corresponding to the training data, which often results in the BO method suggesting meaningless or even invalid inputs to query (e.g. unreasonable/invalid hyperparameter configurations). Despite recent efforts attempting to mitigate this issue kusner2017; griffiths2017; dai2018syntax; daxberger2019bayesian; mahmood2019, a robust and principled solution has yet to be found. This issue also reveals that the latent space optimization paradigm makes it difficult to incorporate (discrete) constraints, as the optimization is performed in a learned continuous latent space rather than in the original input space (over which the constraints are defined). As a result, we decided to not compare against latent space optimization methods at this stage, although we point out that this would be an interesting direction for future work.

Appendix G Acquisition Function Optimization with Theoretical Guarantees via Dual Decomposition

As an alternative to the alternating optimization scheme proposed in Section 3.3, one can also minimize the acquisition function in Eq. 2 via dual decomposition - a powerful approach based on Lagrangian optimization, which has well-studied theoretical properties and has been successfully used for many different problems komodakis2011; sontag2011; rush2012. Despite its versatility, the core idea is simple: decompose the initial problem into smaller solvable subproblems and then extract a solution by cleverly combining the solutions from these subproblems komodakis2011. This requires the following two components: (1) A set of subproblems which are defined such that their sum corresponds to the optimization objective, and which can each be optimized globally, and (2) a so-called master problem that coordinates the subproblems to find a solution to the original problem. One major advantage of dual decomposition algorithms is that they have well-understood theoretical properties888For details, we refer the interested reader to komodakis2011; sontag2011; rush2012, in particular through connections to linear programming (LP) relaxations. In fact, they enjoy the best theoretical guarantees in terms of convergence properties, when compared to other algorithms solving this problem komodakis2011. These theoretical properties further facilitate the convergence analysis of MiVaBo outlined in Section 3.5, making dual decomposition algorithms particularly useful for settings where optimization accuracy is of crucial importance.

We now describe how to devise a dual decomposition for our problem, by demonstrating how it can be reformulated in terms of master- and sub-problems (see Appendix H for a detailed derivation). For convenience, let us denote the discrete, continuous and mixed parts of Eq. (2) by $f^{d}(\mathbf{x}^{d})={\mathbf{w}^{d}}^{\top}\boldsymbol{\phi}^{d}(\mathbf{x}^{d})$ , $f^{c}(\mathbf{x}^{c})={\mathbf{w}^{c}}^{\top}\boldsymbol{\phi}^{c}(\mathbf{x}^{c})$ and $f^{m}(\mathbf{x}^{d},\mathbf{x}^{c})={\mathbf{w}^{m}}^{\top}\boldsymbol{\phi}^{m}(\mathbf{x}^{d},\mathbf{x}^{c})$ , respectively, thus resulting in the representation $f(\mathbf{x})=f^{d}(\mathbf{x}^{d})+f^{c}(\mathbf{x}^{c})+f^{m}(\mathbf{x}^{d},\mathbf{x}^{c})$ . First, we note that the discrete $f^{d}(\mathbf{x}^{d})$ and continuous $f^{c}(\mathbf{x}^{c})$ parts of Eq. (2) already represent easy to solve subproblems (as we assume to have access to an optimization oracle). It thus remains to discuss the mixed part $f^{m}(\mathbf{x}^{d},\mathbf{x}^{c})$ . As $f^{m}$ is generally difficult to optimize directly, we assume that it decomposes into a sum $f^{m}(\mathbf{x})=\sum_{k=1}^{|F|}f^{m}_{k}(\mathbf{x}^{d}_{k},\mathbf{x}^{c}_{k})$ of so-called factors $f^{m}_{k}:\mathcal{X}^{d}_{k}\times\mathcal{X}^{c}_{k}\rightarrow\mathbb{R}$ , where $\mathbf{x}^{d}_{k}\in\mathcal{X}^{d}_{k}$ and $\mathbf{x}^{c}_{k}\in\mathcal{X}^{c}_{k}$ respectively denote subvectors of $\mathbf{x}^{d}$ and $\mathbf{x}^{c}$ from the (typically low-dimensional) subspaces $\mathcal{X}^{d}_{k}\subseteq\mathcal{X}^{d}$ and $\mathcal{X}^{c}_{k}\subseteq\mathcal{X}^{c}$ . Here, $F$ denotes a set of subsets $k\in F$ of the variables. Given this formulation, the initial problem then reduces999Refer to Appendix H for a detailed derivation. to the minimization of the dual function $L(\boldsymbol{\lambda})$ w.r.t. Lagrange multipliers $\boldsymbol{\lambda}$ , i.e., the master problem $\min_{\boldsymbol{\lambda}}L(\boldsymbol{\lambda})$ , with dual function $L(\boldsymbol{\lambda})=\max_{\mathbf{x}^{d}}\big{\{}f^{d}(\mathbf{x}^{d})+\sum_{k\in F}\boldsymbol{\lambda}^{d}_{k}\mathbf{x}^{d}_{|k}\big{\}}+\max_{\mathbf{x}^{c}}\big{\{}f^{c}(\mathbf{x}^{c})+\sum_{k\in F}\boldsymbol{\lambda}^{c}_{k}\mathbf{x}^{c}_{|k}\big{\}}+\sum_{k\in F}\max_{\mathbf{x}^{d}_{k},\mathbf{x}^{c}_{k}}\{f^{m}_{k}(\mathbf{x}^{d}_{k},\mathbf{x}^{c}_{k})-\boldsymbol{\lambda}^{d}_{k}\mathbf{x}_{k}^{d}-\boldsymbol{\lambda}^{c}_{k}\mathbf{x}_{k}^{c}\}$ . Here, the master problem coordinates the $2+|F|$ maximization subproblems, where $\mathbf{x}^{d}_{|k}$ and $\mathbf{x}^{c}_{|k}$ respectively denote the subvectors of $\mathbf{x}^{d}$ and $\mathbf{x}^{c}$ containing only the variables of factor $k\in F$ , $\boldsymbol{\lambda}_{k}^{d}$ and $\boldsymbol{\lambda}_{k}^{c}$ are their corresponding Lagrange multipliers. Intuitively, by updating the dual variables $\boldsymbol{\lambda}$ , the master problem ensures agreement on the involved variables between the discrete and continuous subproblems and the mixed factors. Importantly, the dual function $L(\boldsymbol{\lambda})$ only involves independent maximization over local assignments of $\mathbf{x}^{d},\mathbf{x}^{c}$ and $\mathbf{x}^{d}_{k},\mathbf{x}^{c}_{k}$ , which are assumed to be tractable. There are two main classes of algorithms used for the maximization, namely subgradient methods and block coordinate descent sontag2011.

Appendix H Derivation of Dual Decomposition

One interesting interpretation of our acquisition function optimization problem as defined in Section 3.3 is as maximum a posteriori (MAP) inference in the undirected graphical model, or Markov random field (MRF) koller2009, induced by the dependency graph of the involved variables (i.e. the graph in which vertices correspond to variables, and edges appear between variables that interact in some way). We take this perspective and devise a dual decomposition to tackle the MAP estimation problem induced by our particular setting (i.e., interpreting our acquisition function as the energy function of the graphical model), following the formulation of sontag2011.101010In accordance with the notation in sontag2011, we will here denote the factors by $\theta$ instead of $f$ (i.e., in contrast to the main text).

Consider a graphical model on the vertex set $\mathcal{V}=V_{d}\cup V_{c}$ , where the vertices $V_{d}=\{1,\ldots,D_{d}\}$ and $V_{c}=\{D_{d}+1,\ldots,D_{d}+D_{c}\}$ correspond to the discrete and continuous variables $\mathbf{x}^{d}\in\mathcal{X}^{d}$ and $\mathbf{x}^{c}\in\mathcal{X}^{c}$ , respectively. Furthermore, consider a set $F$ of subsets of both discrete and continuous variables/vertices, i.e., $\forall f\in F:f=(f^{d}\cup f^{c})\subseteq V,\emptyset\neq f^{d}\subseteq V_{d},\emptyset\neq f^{c}\subseteq V_{c}$ , where each subset corresponds to the domain of one of the factors.

Now assume that we are given the following functions on these factors as well as on all discrete/continuous variables:

•

A factor $\theta^{d}(\mathbf{x}^{d}),\theta^{d}:\mathcal{X}^{d}\rightarrow\mathbb{R}$ on all discrete variables

•

A factor $\theta^{c}(\mathbf{x}^{c}),\theta^{c}:\mathcal{X}^{c}\rightarrow\mathbb{R}$ on all continuous variables

•

$|F|$ mixed factors $\theta^{m}_{f}(\mathbf{x}^{d}_{f},\mathbf{x}^{c}_{f}),\theta^{m}_{f}:\mathcal{X}^{d}_{f}\times\mathcal{X}^{c}_{f}\rightarrow\mathbb{R}$ on subsets $f\in F$ of both discrete and continuous variables, where $\mathbf{x}^{d}_{f}\in\mathcal{X}^{d}_{f}$ and $\mathbf{x}^{c}_{f}\in\mathcal{X}^{c}_{f}$ respectively denote subvectors of $\mathbf{x}^{d}$ and $\mathbf{x}^{c}$ from the (typically low-dimensional) subspaces $\mathcal{X}^{d}_{f}\subseteq\mathcal{X}^{d}$ and $\mathcal{X}^{c}_{f}\subseteq\mathcal{X}^{c}$ , indexed by the vertices contained in $f$

The goal of our MAP problem is to find and assignment to all variables $\mathbf{x}^{d}$ and $\mathbf{x}^{c}$ which maximizes the sum of the factors:

[TABLE]

We now slightly reformulate this problem by duplicating the variables $x^{d}_{i}$ and $x^{c}_{j}$ , once for each mixed factor $\theta^{m}_{f}(\mathbf{x}^{d}_{f},\mathbf{x}^{c}_{f})$ , and then enforce that these variables are equal to the ones appearing in the factors $\theta^{d}(\mathbf{x}^{d})$ and $\theta^{c}(\mathbf{x}^{c})$ , respectively. Let $x^{df}_{i}$ and $x^{cf}_{j}$ respectively denote the copy of $x^{d}_{i}$ and $x^{c}_{j}$ used by factor $f$ . Moreover, denote by $\mathbf{x}^{df}_{f}=\{x^{df}_{i}\}_{i\in f^{d}}$ and $\mathbf{x}^{cf}_{f}=\{x^{cf}_{j}\}_{j\in f^{c}}$ the set of variables used by factor $f$ , and by $\mathbf{x}^{F}=\{\mathbf{x}^{df}_{f},\mathbf{x}^{cf}_{f}\}_{f\in F}$ the set of all variable copies. We then get the equivalent (but now constrained) optimization problem

[TABLE]

To remove the coupling constraints, sontag2011 now propose to use the technique of Lagrangian relaxation and introduce a Lagrange multiplier / dual variable $\lambda_{fi}(x_{i})$ for every choice of $f\in F$ , $i\in f$ and $x_{i}$ (i.e. for every factor, for every variable in that factor, and for every value of that variable). These multipliers may then be interpreted as the message that factor $f$ sends to variable $i$ about its state $x_{i}$ .

While this works well if all variables are discrete, in our model we also have continuous variables $x^{c}_{j}$ , and it is clearly not possible to have a Lagrange multiplier for every possible value of $x^{c}_{j}$ . To mitigate this issue, we follow komodakis2011 and instead only introduce a multiplier $\lambda_{fi}$ for every choice of $f\in F$ and $i\in f$ , and model the interaction with the variables as $\lambda_{fi}(x_{i})=\lambda_{fi}x_{i}$ (i.e., the product of a multiplier $\lambda_{fi}$ and variable $x_{i}$ ). Observe that since our goal is to relax the coupling constraints, it is sufficient to introduce one multiplier per constraint. Since we have a constraint for every factor $f\in F$ and every discrete variable $i\in f^{d}$ and continuous variable $j\in f^{c}$ in that factor, our approach is clearly viable.

Note that in contrast to komodakis2011, that introduces a set of multipliers for every factor / subgraph, we only introduce multipliers for the mixed factors $f\in F$ . This is because in contrast to komodakis2011, we do not introduce a full set of variable copies for every factor and then couple them to another global set of ”original” variables, but we instead only introduce variable copies for the mixed factors and couple them to the variables appearing in the discrete and continuous factors, which we assume to be the ”original” variables instead. This essentially is the same approach used in sontag2011, with the difference that sontag2011 introduce a singleton factor for each variable (i.e., a factor which depends only on a single variable), which they consider to be the ”original” variable. They then simply couple the variable copies appearing in the higher-order factors to the ”original” variables appearing in the singleton factors. In contrast, in our formulation we don’t introduce singleton factors to model the ”original” variables, but instead use the fully discrete and continuous factors for this purpose, which clearly works equally well. Note that as a result of this modeling choice, our optimization problem will be unconstrained, regardless of the number of factors, similar as in sontag2011. In contrast, komodakis2011 end up with constraints enforcing that some of the dual variables sum to zero, since they are optimizing out the global set of ”original” variables from their objective, while we keep the set of ”original” variables within our discrete and continuous factors. For this reason, we will in contrast to komodakis2011 later not require a projection step within the subgradient method used to optimize the dual; this is to be detailed further below.

For clarity, we treat discrete and continuous variables distinctly and for factor $f\in F$ denote $\lambda^{d}_{fi}$ and $\lambda^{c}_{fj}$ respectively for the Lagrange multipliers corresponding to its discrete variables $i\in f^{d}$ (or rather, the constraints $x^{df}_{i}=x^{d}_{i}$ ) and its continuous variables $j\in f^{c}$ (or rather, the constraints $x^{cf}_{j}=x^{c}_{j}$ ). For every factor $f\in F$ , we furthermore aggregate its multipliers into the vectors $\boldsymbol{\lambda}_{f}^{d}=\{\lambda^{d}_{fi}\}_{i\in f^{d}}\in\mathbb{R}^{|\mathcal{X}^{d}_{f}|}$ and $\boldsymbol{\lambda}_{f}^{c}=\{\lambda^{c}_{fj}\}_{j\in f^{c}}\in\mathbb{R}^{|\mathcal{X}^{c}_{f}|}$ . The set of all Lagrange multipliers is thus $\boldsymbol{\lambda}=\{\lambda^{d}_{fi}:f\in F,i\in f^{d}\}\cup\{\lambda^{c}_{fj}:f\in F,j\in f^{c}\}=\{\boldsymbol{\lambda}^{d}_{f},\boldsymbol{\lambda}^{c}_{f}\}_{f\in F}$ . We then define the Lagrangian

[TABLE]

This results in the following optimization problem:

[TABLE]

Note that the problem in Eq. (11) is still equivalent to our (hard) original problem in Eq. (9) for any assignment of $\boldsymbol{\lambda}$ , since the Lagrange multipliers cancel out if all coupling constraints are fulfilled.

To obtain a tractable problem, we thus simply omit the coupling constraints in Eq. (11) and define the dual function $L(\boldsymbol{\lambda})$ as

[TABLE]

Note that the maximizations are now fully independent, such that we can (without introducing any ambiguity) simplify the notation for the variables involved in the mixed terms to denote $\mathbf{x}^{d}_{f}$ and $\mathbf{x}^{c}_{f}$ instead of $\mathbf{x}^{df}_{f}$ and $\mathbf{x}^{cf}_{f}$ , respectively111111I.e., we replace all variable copies $\mathbf{x}^{df}_{f},\mathbf{x}^{cf}_{f}$ in the mixed terms by the ”original” variables $\mathbf{x}^{d}_{f},\mathbf{x}^{c}_{f}$ ., resulting in the slightly simpler dual formulation

[TABLE]

Let $\mathbf{x}^{d}_{|f}\in\mathcal{X}^{d}_{f}$ and $\mathbf{x}^{c}_{|f}\in\mathcal{X}^{c}_{f}$ respectively denote the subvectors of $\mathbf{x}^{d}$ and $\mathbf{x}^{c}$ containing only the variables of factor $f$ . The shorthands (or reparameterizations sontag2011)

[TABLE]

further simplify the dual function $L(\boldsymbol{\lambda})$ to

[TABLE]

First, observe that since we maximize over $\mathbf{x}$ and $\mathbf{x}^{F}$ , the dual function $L(\boldsymbol{\lambda})$ is a function of just the Lagrange multipliers $\boldsymbol{\lambda}$ . Note that since $L(\boldsymbol{\lambda})$ maximizes over a larger space (since instead of forcing that there must be one global assignment maximizing the objective, we allow the discrete/continuous potentials to be maximized independently of the mixed potentials, meaning that $\mathbf{x}$ may not coincide with $\mathbf{x}^{F}$ ), we have for all $\boldsymbol{\lambda}$ that

[TABLE]

The dual problem now is to find the tightest upper bound by optimizing the Lagrange multipliers, i.e.

[TABLE]

We also call the dual problem in Eq. (17) the master problem, which coordinates the $2+|F|$ slave problems (i.e., one for each factor)

[TABLE]

where we refer to $s^{d}$ , $s^{c}$ and $s^{f}$ as the discrete slave, the continuous slave, and the mixed slaves, respectively. Using the notation in Eqs. (18a)-(18c), the dual function further simplifies to

[TABLE]

Intuitively, the goal of Eq. (17) is as follows: The master problem wants the discrete/continuous slaves to agree with the mixed slaves/factors in which the corresponding discrete/continuous variables appear, and conversely, it wants the mixed slaves to agree with the slaves/factors of the discrete/continuous variables in its scope. The master problem will thus incentivize the discrete/continuous slaves and the mixed slaves to agree with each other, which is done by updating the dual variables $\boldsymbol{\lambda}$ accordingly.

The key property of the function $L(\boldsymbol{\lambda})$ is that it only involves maximization over local assignments of $\mathbf{x}^{d},\mathbf{x}^{c}$ and $\mathbf{x}^{d}_{f},\mathbf{x}^{c}_{f}$ , which are tasks we assume to be tractable. The dual thus decouples the original problem, resulting in a problem that can be optimized using local operations. Algorithms that minimize the approximate objective $L(\boldsymbol{\lambda})$ use local updates where each iteration of the algorithms repeatedly finds a maximizing assignment for the subproblems individually, using these to update the dual variables $\boldsymbol{\lambda}$ that glue the subproblems together. There are two main classes of algorithms of this kind, one based on a subgradient method and another based on block coordinate descent sontag2011.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Mixed-Variable Bayesian Optimization

Abstract

1 Introduction

Related Work.

Contributions.

2 Problem Statement

Background.

3 MiVaBo Algorithm

3.1 Model

Discrete Features ϕd\boldsymbol{\phi}^{d}ϕd.

Continuous Features ϕc\boldsymbol{\phi}^{c}ϕc.

Mixed Features ϕm\boldsymbol{\phi}^{m}ϕm.

3.2 Model Inference

3.3 Acquisition Function

3.4 Model Discussion

3.5 Convergence Analysis

Proposition 1**.**

4 Experiments

Experimental Setup.

4.1 Gradient Boosting Tuning

4.2 Deep Generative Model (DGM) Tuning

5 Conclusion

Acknowledgements

References

Appendix A Sparse Linear Model via Sparsity-Encouraging Prior

Appendix B Pseudocode for Thompson Sampling

Appendix C Further Experimental Results

C.1 Synthetic Benchmark for Unconstrained Optimization

C.2 Synthetic Benchmark for Constrained Optimization

Appendix D More Details on XGBoost Hyperparameter Tuning Task

Appendix E More Details on VAE Hyperparameter Tuning Task

E.1 Hyperparameters of VAE

E.2 Description of Constraints

Encoder constraints.

Decoder constraints.

E.3 Effect of Different Constraint Violation Penalty Values

E.4 Visualization of Reconstruction Quality

Appendix F Discussion and Details on Baselines in Empirical Evaluation

Appendix G Acquisition Function Optimization with Theoretical Guarantees via Dual Decomposition

Appendix H Derivation of Dual Decomposition

References

Discrete Features $\boldsymbol{\phi}^{d}$ .

Continuous Features $\boldsymbol{\phi}^{c}$ .

Mixed Features $\boldsymbol{\phi}^{m}$ .

Proposition 1.