Towards Practical Lipschitz Bandits

Tianyu Wang; Weicheng Ye; Dawei Geng; Cynthia Rudin

arXiv:1901.09277·stat.ML·January 25, 2021

Towards Practical Lipschitz Bandits

Tianyu Wang, Weicheng Ye, Dawei Geng, Cynthia Rudin

PDF

TL;DR

This paper introduces a flexible framework for Lipschitz bandit algorithms that adaptively partitions the space to optimize rewards efficiently, demonstrating state-of-the-art results in real-world tasks like hyperparameter tuning.

Contribution

We develop a novel adaptive partitioning framework for Lipschitz bandits, linking tree-based methods to Gaussian processes and proposing a hierarchical Bayesian model.

Findings

01

Achieves state-of-the-art performance in neural network hyperparameter tuning.

02

Effectively balances exploration and exploitation through adaptive space partitioning.

03

Demonstrates improved regret minimization in real-world applications.

Abstract

Stochastic Lipschitz bandit algorithms balance exploration and exploitation, and have been used for a variety of important task domains. In this paper, we present a framework for Lipschitz bandit methods that adaptively learns partitions of context- and arm-space. Due to this flexibility, the algorithm is able to efficiently optimize rewards and minimize regret, by focusing on the portions of the space that are most relevant. In our analysis, we link tree-based methods to Gaussian processes. In light of our analysis, we design a novel hierarchical Bayesian model for Lipschitz bandit problems. Our experiments show that our algorithms can achieve state-of-the-art performance in challenging real-world tasks such as neural network hyperparameter tuning.

Tables6

Table 1. (a)

Layer	Hyperparameters	values
Conv1	conv1-kernel-size	*
	conv1-number-of-channels	200
	conv1-stride-size	(1,1)
MaxPooling1	pooling1-size	(3,3)
MaxPooling1	pooling1-stride	(1,1)
Conv2	conv2-kernel-size	*
	conv2-number-of-channels	200
	conv2-stride-size	(1,1)
MaxPooling2	pooling2-size	(3,3)
MaxPooling2	pooling2-stride	(2,2)
Conv3	conv3-kernel-size	(3,3)
	conv3-number-of-channels	200
	conv3-stride-size	(1,1)
AvgPooling3	pooling3-size	(3,3)
AvgPooling3	pooling3-stride	(1,1)
Dense	batch-normalization	default
	number-of-hidden-units	512
	dropout-rate	0.5

Table 2. (a)

Layer	Hyperparameters	values
Conv1	conv1-kernel-size	*
	conv1-number-of-channels	200
	conv1-stride-size	(1,1)
MaxPooling1	pooling1-size	(3,3)
MaxPooling1	pooling1-stride	(1,1)
Conv2	conv2-kernel-size	*
	conv2-number-of-channels	200
	conv2-stride-size	(1,1)
MaxPooling2	pooling2-size	(3,3)
MaxPooling2	pooling2-stride	(2,2)
Conv3	conv3-kernel-size	(3,3)
	conv3-number-of-channels	200
	conv3-stride-size	(1,1)
AvgPooling3	pooling3-size	(3,3)
AvgPooling3	pooling3-stride	(1,1)
Dense	batch-normalization	default
	number-of-hidden-units	512
	dropout-rate	0.5

Table 3. (b)

Hyperparameters		Range
conv1-kernel-size		${1, 2, \dots, 7}$
conv2-kernel-size		${1, 2, \dots, 7}$
$β_{1}$ & $β_{2}$		${0, 0.05, \dots, 1}$
learning-rate		1e-6 to 5
training-iteration		${300, 400, \dots, 1500}$

Table 4. (a)

Layer	Hyperparameters	values
Conv1	conv1-kernel-size	*
	conv1-no.-of-channels	200
	conv1-stride-size	(1,1)
MaxPooling1	pooling1-size	*
MaxPooling1	pooling1-stride	(1,1)
Conv2	conv2-kernel-size	*
	conv2-no.-of-channels	200
	conv2-stride-size	(1,1)
MaxPooling2	pooling2-size	*
MaxPooling2	pooling2-stride	(2,2)
Conv3	conv3-kernel-size	*
	conv3-no.-of-channels	200
	conv3-stride-size	(1,1)
AvgPooling3	pooling3-size	*
	pooling3-stride	(1,1)
	pooling3-padding	“same”
Dense	batch-normalization	default
	no.-of-hidden-units	512
	dropout-rate	0.5

Table 5. (a)

Layer	Hyperparameters	values
Conv1	conv1-kernel-size	*
	conv1-no.-of-channels	200
	conv1-stride-size	(1,1)
MaxPooling1	pooling1-size	*
MaxPooling1	pooling1-stride	(1,1)
Conv2	conv2-kernel-size	*
	conv2-no.-of-channels	200
	conv2-stride-size	(1,1)
MaxPooling2	pooling2-size	*
MaxPooling2	pooling2-stride	(2,2)
Conv3	conv3-kernel-size	*
	conv3-no.-of-channels	200
	conv3-stride-size	(1,1)
AvgPooling3	pooling3-size	*
	pooling3-stride	(1,1)
	pooling3-padding	“same”
Dense	batch-normalization	default
	no.-of-hidden-units	512
	dropout-rate	0.5

Table 6. (b)

Hyperparameters		Range
conv1-kernel-size		${1, 2, \dots, 7}$
conv2-kernel-size		${1, 2, \dots, 7}$
conv3-kernel-size		${1, 2, 3}$
pooling1-size & pooling2-size		${1, 2, 3}$
pooling3-size		${1, 2, \dots, 6}$
$β_{1}$ & $β_{2}$		${0, 0.05, \dots, 1}$
learning-rate		1e-6 to 5
learning-rate-redeuction		{1,2,3}
training-iteration		${200, 400, \dots, 3000}$

Equations120

R_{T} (Alg) = t = 1 \sum T (f (a^{*}) - f (a_{t})),

R_{T} (Alg) = t = 1 \sum T (f (a^{*}) - f (a_{t})),

n_{t, t^{'}}^{0} (x) = i = 1 \sum t^{'} I [x_{i} \in p_{t} (x)] .

n_{t, t^{'}}^{0} (x) = i = 1 \sum t^{'} I [x_{i} \in p_{t} (x)] .

m_{t, t^{'}} (a) = ⎩ ⎨ ⎧ \frac{\sum _{i = 1}^{t^{'}} y _{i} I [ a _{i} \in p _{t} ( a )]}{n _{t, t^{'}}^{0} ( a )}, if n_{t, t^{'}}^{0} (a) > 0; 1, otherwise .

m_{t, t^{'}} (a) = ⎩ ⎨ ⎧ \frac{\sum _{i = 1}^{t^{'}} y _{i} I [ a _{i} \in p _{t} ( a )]}{n _{t, t^{'}}^{0} ( a )}, if n_{t, t^{'}}^{0} (a) > 0; 1, otherwise .

n_{t, t^{'}} (x)

n_{t, t^{'}} (x)

U_{t} (a) = m_{t - 1} (a) + C \frac{4 lo g t}{n _{t - 1} ( a )} + M \cdot D (p_{t} (a)),

U_{t} (a) = m_{t - 1} (a) + C \frac{4 lo g t}{n _{t - 1} ( a )} + M \cdot D (p_{t} (a)),

a_{t} \in ar g a \in A max {U_{t} (a)},

a_{t} \in ar g a \in A max {U_{t} (a)},

D (p_{t} (a)) := a^{'}, a^{''} \in p_{t} (a) sup d (a^{'}, a^{''})

D (p_{t} (a)) := a^{'}, a^{''} \in p_{t} (a) sup d (a^{'}, a^{''})

T \to \infty lim \frac{R _{T} ( T U C B )}{T} = 0

T \to \infty lim \frac{R _{T} ( T U C B )}{T} = 0

∣ m_{t - 1} (a) - f (a) ∣ \leq L \cdot D (p_{t - 1} (a)) + C \frac{4 lo g t}{n _{t - 1} ( a )}

∣ m_{t - 1} (a) - f (a) ∣ \leq L \cdot D (p_{t - 1} (a)) + C \frac{4 lo g t}{n _{t - 1} ( a )}

f (a^{*}) - f (a_{t}) \leq 2 L \cdot D (p_{t - 1} (a_{t})) + 2 C \frac{4 lo g t}{n _{t - 1} ( a _{t} )}

f (a^{*}) - f (a_{t}) \leq 2 L \cdot D (p_{t - 1} (a_{t})) + 2 C \frac{4 lo g t}{n _{t - 1} ( a _{t} )}

t = 1 \sum T \frac{1}{n _{t - 1} ( a _{t} )} \leq e ∣ P_{T} ∣ lo g (1 + (e - 1) \frac{T}{∣ P _{T} ∣}),

t = 1 \sum T \frac{1}{n _{t - 1} ( a _{t} )} \leq e ∣ P_{T} ∣ lo g (1 + (e - 1) \frac{T}{∣ P _{T} ∣}),

t = 1 \sum T \frac{1}{1 + n _{t - 1}^{0} ( a _{t} )} \leq ∣ P_{T} ∣ (1 + lo g \frac{T}{∣ P _{T} ∣}),

t = 1 \sum T (\frac{1}{1 + n _{t - 1}^{0} ( a _{t} )})^{α} \leq \frac{1}{1 - α} ∣ P_{T} ∣^{α} T^{1 - α}, 0 < α < 1,

k_{T} (a, a^{'}) = {1, if p_{T} (a) = p_{T} (a^{'}) 0, otherwise.

k_{T} (a, a^{'}) = {1, if p_{T} (a) = p_{T} (a^{'}) 0, otherwise.

σ_{T, t}^{2} (a) = k_{T} (a, a) - k_{a}^{T} (K + s_{T}^{2} I)^{- 1} k_{a},

σ_{T, t}^{2} (a) = k_{T} (a, a) - k_{a}^{T} (K + s_{T}^{2} I)^{- 1} k_{a},

σ_{T, t}^{2} (a) = 1 - 1_{a} [1_{a} 1_{a}^{⊤} + s_{T}^{2} I]^{- 1} 1_{a},

σ_{T, t}^{2} (a) = 1 - 1_{a} [1_{a} 1_{a}^{⊤} + s_{T}^{2} I]^{- 1} 1_{a},

σ_{T, t}^{2} (a) = \frac{1}{1 + s _{T}^{- 2} n _{T, t}^{0} ( a )} .

σ_{T, t}^{2} (a) = \frac{1}{1 + s _{T}^{- 2} n _{T, t}^{0} ( a )} .

H (\tilde{y}_{t}) = \frac{1}{2} lo g [(2 π e)^{t} det (K + s_{T}^{2} I)]

H (\tilde{y}_{t}) = \frac{1}{2} lo g [(2 π e)^{t} det (K + s_{T}^{2} I)]

H (\tilde{y}_{t})

H (\tilde{y}_{t})

= H (\tilde{y}_{t} ∣ a_{t}, \tilde{y}_{t - 1}, a_{t - 1}) + H (\tilde{y}_{t - 1})

= \frac{1}{2} lo g (2 π e (s_{T}^{2} + σ_{T, t - 1}^{2} (a_{t}))) + H (\tilde{y}_{t - 1})

= \frac{1}{2} τ = 1 \sum t lo g (2 π e (s_{T}^{2} + σ_{T, τ - 1}^{2} (a_{τ}))),

τ = 1 \sum t lo g (1 + s^{- 2} σ_{T, τ - 1}^{2} (a_{τ})) = lo g [det (s^{- 2} K + I)] .

τ = 1 \sum t lo g (1 + s^{- 2} σ_{T, τ - 1}^{2} (a_{τ})) = lo g [det (s^{- 2} K + I)] .

det (s^{- 2} K + I)

det (s^{- 2} K + I)

= i = 1 \prod B^{'} (1 + s^{- 2} n_{i}) \leq (1 + \frac{s ^{- 2} t}{B ^{'}})^{B^{'}},

det (s^{- 2} K + I) \leq (1 + \frac{s ^{- 2} t}{B ^{'}})^{B^{'}} \leq (1 + \frac{s ^{- 2} t}{∣ P _{t} ∣})^{∣ P_{t} ∣} .

det (s^{- 2} K + I) \leq (1 + \frac{s ^{- 2} t}{B ^{'}})^{B^{'}} \leq (1 + \frac{s ^{- 2} t}{∣ P _{t} ∣})^{∣ P_{t} ∣} .

τ = 1 \sum T lo g (1 + s^{- 2} σ_{T, τ - 1}^{2} (a_{τ})) \leq ∣ P_{T} ∣ lo g (1 + \frac{s ^{- 2} T}{∣ P _{T} ∣}),

τ = 1 \sum T lo g (1 + s^{- 2} σ_{T, τ - 1}^{2} (a_{τ})) \leq ∣ P_{T} ∣ lo g (1 + \frac{s ^{- 2} T}{∣ P _{T} ∣}),

σ_{T, t}^{2} (a) \leq \frac{1}{lo g ( 1 + s _{T}^{- 2} )} lo g (1 + s_{T}^{- 2} σ_{T, t}^{2} (a))

σ_{T, t}^{2} (a) \leq \frac{1}{lo g ( 1 + s _{T}^{- 2} )} lo g (1 + s_{T}^{- 2} σ_{T, t}^{2} (a))

t = 1 \sum T \frac{1}{n _{t - 1} ( a _{t} )} \leq t = 1 \sum T \frac{1 + s _{T}^{- 2}}{1 + s _{T}^{- 2} n _{t - 1} ( a _{t} )}

t = 1 \sum T \frac{1}{n _{t - 1} ( a _{t} )} \leq t = 1 \sum T \frac{1 + s _{T}^{- 2}}{1 + s _{T}^{- 2} n _{t - 1} ( a _{t} )}

\leq t = 1 \sum T \frac{1 + s _{T}^{- 2}}{1 + s _{T}^{- 2} n _{T, t - 1}^{0} ( a _{t} )} \leq (1 + s_{T}^{- 2}) t = 1 \sum T σ_{T, t - 1}^{2} (a_{t})

\leq \frac{1 + s _{T}^{- 2}}{lo g ( 1 + s _{T}^{- 2} )} t = 1 \sum T lo g (1 + s_{T}^{- 2} σ_{T, t - 1}^{2} (a_{t}))

\leq \frac{1 + s _{T}^{- 2}}{lo g ( 1 + s _{T}^{- 2} )} ∣ P_{T} ∣ lo g (1 + s_{T}^{- 2} \frac{T}{∣ P _{T} ∣}),

t = 1 \sum T \frac{1}{n _{t - 1} ( a _{t} )}

t = 1 \sum T \frac{1}{n _{t - 1} ( a _{t} )}

t = 1 \sum T \frac{1}{1 + n _{t - 1}^{0} ( x _{t} )}

t = 1 \sum T \frac{1}{1 + n _{t - 1}^{0} ( x _{t} )}

\leq j = 1 \sum ∣ P_{T} ∣ (1 + lo g b_{j}) = ∣ P_{T} ∣ + j = 1 \sum ∣ P_{T} ∣ lo g b_{j}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Towards Practical Lipschitz Bandits

Tianyu [email protected] Weicheng [email protected] Dawei [email protected] Cynthia [email protected]

Abstract

Stochastic Lipschitz bandit algorithms balance exploration and exploitation, and have been used for a variety of important task domains. In this paper, we present a framework for Lipschitz bandit methods that adaptively learns partitions of context- and arm-space. Due to this flexibility, the algorithm is able to efficiently optimize rewards and minimize regret, by focusing on the portions of the space that are most relevant. In our analysis, we link tree-based methods to Gaussian processes. In light of our analysis, we design a novel hierarchical Bayesian model for Lipschitz bandit problems. Our experiments show that our algorithms can achieve state-of-the-art performance in challenging real-world tasks such as neural network hyperparameter tuning.

1 Introduction

Stochastic Lipschitz bandit algorithms are methods that balance exploration-exploitation tradeoffs. Their usage arises in important real-world scenarios. For example, in medical trials, a doctor might deliver a sequence of treatment options with the goal of achieving the best total treatment effect, or with the goal of allocating the best treatment option as efficiently as possible, without conducting too many trials.

A stochastic bandit problem assumes that payoffs are noisy and are drawn from an unchanging distribution. The study of stochastic bandit problems started with the discrete arm setting, where the agent is faced with a finite set of choices. Classic works on this problem include Thompson sampling (Thompson, 1933; Agrawal and Goyal, 2012), Gittins index (Gittins, 1979), $\epsilon$ -greedy strategies (Sutton and Barto, 1998), and upper confidence bound (UCB) methods (Lai and Robbins, 1985; Auer et al., 2002). One recent line of work on stochastic bandit problems considers the case where the arm space is infinite. In this setting, the arms are usually assumed to be in a subset of the Euclidean space (or a more general metric space), and the expected payoff function is assumed to be a function of the arms. Some works along this line model the expected payoff as a linear function of the arms (Auer, 2002; Dani et al., 2008; Li et al., 2010; Abbasi-yadkori et al., 2011; Agrawal and Goyal, 2013); some algorithms model the expected payoff as Gaussian processes over the arms (Srinivas et al., 2010; Contal et al., 2014; de Freitas et al., 2012); some algorithms assume that the expected payoff is a Lipschitz function of the arms (Slivkins, 2014; Kleinberg et al., 2008; Bubeck et al., 2011; Magureanu et al., 2014); and some assume locally Hölder payoffs on the real line (Auer et al., 2007). When the arms are continuous and equipped with a metric, and the expected payoff is Lipschitz continuous in the arm space, we refer to the problem as a stochastic Lipschitz bandit problem. In addition, when the agent’s decisions are made with the aid of contextual information, we refer to the problem as a contextual stochastic Lipschitz bandit problem. Not many works Bubeck et al. (2011); Kleinberg et al. (2008); Magureanu et al. (2014) have considered the general Lipschitz bandit problem without making strong assumptions on the smoothness of rewards in context-arm space. In this paper, we focus our study on this general (contextual) stochastic Lipschitz bandit problem, and provide practical algorithms for use in data science applications.

Specifically, we propose a framework that converts a general decision tree algorithm into an algorithm for stochastic Lipschitz bandit problems. We use a novel analysis that links our algorithms to Gaussian processes; though the underlying rewards do not need to be generated by any Gaussian process. Based on this connection, we can use a novel hierarchical Bayesian model to design a new (UCB) index. This new index solves two main problems suffered by partition based bandit algorithms. Namely, (1) within each bin of the partition, all arms are treated the same; (2) disjoint bins do not use information from each other.

Empirically, we show that using adaptively learned partitions, Lipschitz bandit algorithms can be used for hard real-world problems such as hyperparameter tuning for neural networks.

**Relation to prior work: ** One general way of solving stochastic Lipschitz bandit problems is to finely discretize (partition) the arm space and treat the problem as a finite-arm problem. An Upper Confidence Bound (UCB) strategy can thus be used. Previous algorithms of this kind include the UniformMesh algorithm (Kleinberg et al., 2008), the HOO algorithm (Bubeck et al., 2011), and the (contextual) Zooming Bandit algorithm (Kleinberg et al., 2008; Slivkins, 2014). While all these algorithms employ different analysis techniques, we show that as long as a discretization of the arm space fulfills certain requirements (outlined in Theorem 1), these algorithms (or a possibly modified version) can be analyzed in a unified framework.

The practical problem with previous methods is that they require either a fine discretization of the full arm space or restrictive control of the partition formation (e.g., Zooming rule (Kleinberg et al., 2008)), leading to implementations that are not flexible. By fitting decision trees that are grown adaptively during the run of the algorithm, our partition can be learned from data. This advantage enables the algorithm to outperform leading methods for Lipschitz bandits (e.g. Bubeck et al. (2011); Kleinberg et al. (2008)) and for zeroth order optimization (e.g. (Martinez-Cantin, 2014; Li et al., 2016)) on hard real-world problems that can involve difficult arm space and reward landscape. As shown in the experiments, in neural network hyperparameter tuning, our methods can outperform the state-of-the-art benchmark packages that are tailored for hyperparameter selection.

In summary, our contributions are: 1) We develop a novel stochastic Lipschitz bandit framework, TreeUCB and its contextual counterpart Contextual TreeUCB. Our framework converts a general decision tree algorithm into a stochastic Lipschitz bandit algorithm. Algorithms arising from this framework empirically outperform benchmarks methods. 2) We develop a new analysis framework, which can be used to recover previous known bounds, and design a new principled acquisition function in bandits and zero-th order optimization.

2 Main results

2.1 The TreeUCB framework

Stochastic bandit algorithms, in an online fashion, explore the decision space while exploit seemingly good options. The performance of the algorithm is typically measured by regret. In this paper, we focus our study on the following setting. A payoff function is defined over an arm space that is a compact doubling metric space $(\mathcal{A},d)$ , the payoff function of interest is $f:\mathcal{A}\rightarrow[0,1]$ , and the actual observations are given by $y(a)=f(a)+\epsilon_{a}$ . In our setting, the noise distribution $\epsilon_{a}$ could vary with $a$ , as long as it is uniformly mean zero, almost surely bounded, and independent of $f$ for every $a$ . Our results easily generalize to sub-Gaussian noise (Shamir, 2011). In the analysis, we assume that the (expected) payoff function $f$ is Lipschitz in the sense that $\forall a,a^{\prime}\in\mathcal{A}$ , $\left|f(a)-f(a^{\prime})\right|\leq Ld(a,a^{\prime})$ for some Lipschitz constant $L$ . An agent is interacting with this environment in the following fashion. At each round $t$ , based on past observations $(a_{1},y_{1},\cdots,a_{t-1},y_{t-1})$ , the agent makes a query at point $a_{t}$ and observes the (noisy) payoff $y_{t}$ , where $y_{t}$ is revealed only after the agent has made a decision $a_{t}$ . For an agent executing algorithm Alg, the regret incurred up to time $T$ is defined to be:

[TABLE]

where $a^{*}$ is the global maximizer of $f$ .

Any TreeUCB algorithm runs by maintaining a sequence of finite partitions of the arm space. Intuitively, at each step $t$ , TreeUCB treats the problem as a finite-arm bandit problem with respect to the partition bins at $t$ , and chooses an arm uniformly at random within the chosen bin. The partition bins become smaller and smaller as the algorithm runs. Thus, at any time $t$ , we maintain a partition $\mathcal{P}_{t}=\left\{P_{t}^{(1)},\cdots,P_{t}^{(k_{t})}\right\}$ of the input space. That is, $P_{t}^{(1)},\cdots,P_{t}^{(k_{t})}$ are subsets of $\mathcal{A}$ , are mutually disjoint and $\cup_{i=1}^{k_{t}}P_{t}^{(i)}=\mathcal{A}$ .

As an example, Figure 1 shows an partitioning of the input space, with the underlying reward function shown by color gradient. In an algorithm run, we collect data and estimate the reward with respect to the partition. Based on the estimate, we select a “box” to play next.

Each element in the partition is called a region and by convention $\mathcal{P}_{0}=\{\mathcal{A}\}$ . The regions could be leaves in a tree, or chosen in some other way.

Given any $t$ , if for any $P^{(i)}\in\mathcal{P}_{t+1}$ , there exists $P^{(j)}\in\mathcal{P}_{t}$ such that $P^{(i)}\subset P^{(j)}$ , we say that $\{\mathcal{P}_{t}\}_{t\geq 0}$ is a sequence of nested partitions. In words, at round $t$ , some regions (or no regions) of the partition are split into multiple regions to form the partition at round $t+1$ . We also say that the partition grows finer.

Based on the partition $\mathcal{P}_{t}$ at time $t$ , we define an auxiliary function – the Region Selection function.

Definition 1 (Region Selection Function).

Given partition $\mathcal{P}_{t}$ , function $p_{t}:\mathcal{A}\rightarrow\mathcal{P}_{t}$ is called a Region Selection Function with respect to $\mathcal{P}_{t}$ if for any $a\in\mathcal{A}$ , $p_{t}(a)$ is the region in $\mathcal{P}_{t}$ containing $a$ .

As the name TreeUCB suggests, our framework follows an Upper Confidence Bound (UCB) strategy. In order to define our Upper Confidence Bound, we require several definitions.

Definition 2.

*Let $\mathcal{P}_{t}$ be the partition of $\mathcal{A}$ at time $t$ ( $t\geq 1$ ) and let $p_{t}$ be the Region Selection Function associated with $\mathcal{P}_{t}$ . Let $(a_{1},y_{1},a_{2},y_{2},\cdots,a_{t^{\prime}},y_{t^{\prime}})$ be the observations received up to time $t^{\prime}$ ( $t^{\prime}\geq 1$ ). We define

$\bullet$ the count function $n_{t,t^{\prime}}^{0}:\mathcal{A}\rightarrow\mathbb{R}$ , such that*

[TABLE]

$\bullet$ * the corrected average function $m_{t,t^{\prime}}:\mathcal{A}\rightarrow\mathbb{R}$ , such that*

[TABLE]

$\bullet$ * the corrected count function, such that*

[TABLE]

When $t=t^{\prime}$ , we shorten the notation from $m_{t,t^{\prime}}$ to $m_{t}$ , $n_{t,t^{\prime}}^{0}$ to $n_{t}^{0}$ , and $n_{t,t^{\prime}}$ to $n_{t}$ .

In words, $n^{0}_{t,t^{\prime}}(a)$ is the number of points among $(a_{1},a_{2},\cdots,a_{t^{\prime}})$ that are in the same region as arm $a$ , with regions as elements in $\mathcal{P}_{t}$ . We also denote by $D(\mathcal{S})$ the diameter of $\mathcal{S}\subset\mathcal{A}$ , and $D(\mathcal{S}):=\sup_{a^{\prime},a^{\prime\prime}\in\mathcal{S}}d(a^{\prime},a^{\prime\prime})$ .

At time $t$ , based on the partition and observations, our bandit algorithm uses, for $a\in\mathcal{A}$

[TABLE]

for some $C$ and $M$ as the Upper Confidence Bound of arm $a$ ; and we play an arm with the highest $U_{t}$ value (with ties broken uniformly at random).

Remark 1.

As we will discuss in Section 2.3.2, the upper confidence index for our decision can take different forms other than (3).

Here $C$ depends on the almost sure bound on the reward, and $M$ depends on the Lipschitz constant of the expected reward, which are both problem intrinsics.

Since $U_{t}$ is a piece-wise constant function in the arm-space and is constant within each region, playing an arm with the highest $U_{t}$ with random tie-breaking is equivalent to selecting the best region (under UCB) and randomly selecting an arm within the region. After deciding which arm to play, we update the partition into a finer one if eligible. This strategy, TreeUCB, is summarized in Algorithm 1. We also provide a provable guarantee for TreeUCB algorithms in Theorem 1.

Theorem 1.

*Suppose that the payoff function $f$ defined on a compact domain $\mathcal{A}$ satisfies $f(a)\in[0,1]$ for all $a$ and is Lipschitz. Let $\mathcal{P}_{t}$ be the partition at time $t$ in Algorithm 1. If the tree fitting rule $\mathcal{R}$ satisfies

(1) $\{\mathcal{P}_{t}\}_{t\geq 0}$ is a sequence of nested partitions (or the partition grows finer);

(2) $|\mathcal{P}_{t}|=\ o(t^{\gamma})$ for some $\gamma<1$ ;

(3) $D(p_{t}(a))=o(1)$ for all $a\in\mathcal{A}$ , where*

[TABLE]

*is the diameter of region $p_{t}(a)$ ;

(4) given all realized observations $\{(a_{t},y_{t})\}_{t=1}^{T}$ , the partitions $\{\mathcal{P}_{t}\}_{t=1}^{T}$ are deterministic; then the regret for Algorithm 1 satisfies*

[TABLE]

with probability 1.

The above assumptions are all mild and reasonable. For item 1, we can use incremental tree learning (Utgoff, 1989) to enforce nested partitions. For item 2, we may put a cap (that may depend on $t$ ) on the depth of the tree to constrain it. For item 3, we may put a cap (that may depend on $t$ ) on tree leaf diameters to ensure it. For item 4, any non-random tree learning rule meets this criteria, since in this case, the randomness only comes from the data (and/or number of data points observed).

We now discuss the proof of Theorem 1. Throughout the rest of the paper, we use $\widetilde{\mathcal{O}}$ to omit constants and poly-log terms unless otherwise noted. To prove Theorem 1, we first use Claims 1 and 2 to bound the single step regret, we then use Lemma 1 and Assumptions (1) – (3) to bound the total regret.

To start with, we first present the following two claims, which may also be carefully extracted from previous works (e.g. Bubeck et al., 2011).

Claim 1.

For an arbitrary arm $a$ , and time $t$ , with probability at least $1-\frac{1}{t^{4}}$ , we have,

[TABLE]

for a constant $C$ that depends only on the a.s. bound of the reward.

Claim 2.

At any $t$ , with probability at least $1-\frac{1}{t^{4}}$ , the single step regret satisfies:

[TABLE]

for a constant $C$ , that depends only on the a.s. bound of the reward.

In Section 2.2, we prove general versions of Claims 1 and 2.

As the tree (partition) grows finer, the term $n_{t-1}(a)$ is not necessarily increasing with $t$ (for an arbitrary fixed $a$ ). Therefore part of the difficulty is in bounding $\sum_{t=1}^{T}\frac{1}{n_{t-1}(a_{t})}$ . Next, we introduce a new set of inequalities, which we call “point scattering” inequalities in Lemma 1 to bound this term.

Lemma 1 (Point Scattering Inequalities).

For an arbitrary sequence of points $a_{1},a_{2},\cdots$ in a space $\mathcal{A}$ , and any sequence of nested partitions $\mathcal{P}_{1},\mathcal{P}_{2},\cdots$ of the same space $\mathcal{A}$ , we have, for any $T$ ,

[TABLE]

where $n_{t-1}^{0}$ and $n_{t-1}$ are the count and corrected count function as in Definition 2, and $|\mathcal{P}_{T}|$ is the cardinality of the finite partition $\mathcal{P}_{T}$ .

As defined in Definition 2, $n_{t-1}^{0}(a_{t})$ is the number of points that are in the same bin (in partition $\mathcal{P}_{t-1}$ ) as $a_{t}$ . Also, $n_{t-1}(a_{t})$ is the “corrected” version of $n_{t-1}^{0}(a_{t})$ : $n_{t-1}(a_{t})=\max(1,n_{t-1}^{0}(a_{t}))$ .

Remark 2.

We shall notice that (6) allows us to somewhat “look one step ahead of time”, since it uses the values $\{n_{t-1}(a_{t})\}_{t}$ - the corrected counts without including $a_{t}$ . This is because $n_{t-1}$ is computed using points up to time $t-1$ . The equation (7) is different from (6) in the sense that $\{1+n_{t-1}^{0}(a_{t})\}_{t}$ are essentially the counts including $a_{t}$ . While, with proper modification, both (6) and (7) can be used to derive Theorem 1, we shall not ignore the difference between (6) and (7).

2.1.1 Proof of (6)

We use a novel constructive trick to derive (6). This trick and the usefulness of the result (Remarks 2 and 3 and Section 2.3) mark our major technical contribution. The trick is to consider the incidence matrix of which points are within the same partition bin, and use this matrix as if it were a covariance matrix for a Gaussian process. Then, we use knowledge about Gaussian processes to bound the sum of the inverse of the number of points in each bin over time.

For each $T$ , we construct a hypothetical noisy degenerate Gaussian process. We are not assuming our payoffs are drawn from these Gaussian processes. We only use these Gaussian processes as a proof tool. To construct these noisy degenerate Gaussian processes, we define the kernel functions $k_{T}:\mathcal{A}\times\mathcal{A}\rightarrow\mathbb{R}$ ,

[TABLE]

where $p_{T}$ is the region selection function defined with respect to $\mathcal{P}_{T}$ . The kernel $k_{T}$ is positive semi-definite as shown in Proposition 1.

Proposition 1.

The kernel defined in (9) is positive semi-definite for any $T\geq 1$ .

Proof.

For any $x_{1},\dots,x_{n}$ in where the kernel $k_{T}(\cdot,\cdot)$ is defined, the Gram matrix $K=\begin{bmatrix}k_{T}(x_{i},x_{j})\end{bmatrix}_{n\times n}$ can be written into block diagonal form where diagonal blocks are all-one matrices and off-diagonal blocks are all zeros with proper permutations of rows and columns. Thus without loss of generality, for any vector $\bm{v}=[v_{1},v_{2},\dots,v_{n}]\in\mathbb{R}^{n}$ , $\bm{v}^{\top}K\bm{v}=\sum_{b=1}^{B}\left(\sum_{j:i_{j}\text{ in block }b}v_{i_{j}}\right)^{2}\geq 0$ where the first summation is taken over all diagonal blocks and $B$ is the total number of diagonal blocks in the Gram matrix. ∎

Now, at any time $T$ , let us consider the model $\tilde{y}(a)=g(a)+e_{T}$ where $g$ is drawn from a Gaussian process $g\sim\mathcal{GP}\left(0,k_{T}(\cdot,\cdot)\right)$ and $e_{T}\sim\mathcal{N}(0,s^{2}_{T})$ . Suppose that the arms and hypothetical payoffs $\{(a_{1},\tilde{y}_{1}),(a_{2},\tilde{y}_{2}),\dots,(a_{t},\tilde{y}_{t})\}$ are observed from this Gaussian process. The posterior variance for this Gaussian process after the observations at $a_{1},a_{2},\dots,a_{t}$ is

[TABLE]

where $\bm{k}_{a}=[k_{T}(a,a_{1}),\dots,k_{T}(a,a_{t})]^{\top}$ , $K=[k_{T}(a_{i},a_{j})]_{t\times t}$ and $I$ is the identity matrix. In other words, $\sigma^{2}_{T,t}(a)$ is the posterior variance using points up to time $t$ with the kernel defined by the partition at time $T$ . After some matrix manipulation, we know that

[TABLE]

where $\bm{1}_{a}=[1,\cdots,1]_{1\times{n^{0}_{T,t}(a)}}^{\top}$ . By the Sherman-Morrison formula, $[\bm{1}_{a}\bm{1}_{a}^{\top}+s^{2}_{T}I]^{-1}=s^{-2}_{T}I-\frac{s^{-4}_{T}\bm{1}_{a}\bm{1}_{a}^{\top}}{1+s^{-2}_{T}n^{0}_{T,t}(a)}$ . Thus the posterior variance is

[TABLE]

Following the arguments in (Srinivas et al., 2010), we derive the following results. For any $t\leq T$ , and an arbitrary sequence $\bm{a}_{t}=\{a_{1},a_{2},\cdots,a_{t}\}$ , we consider fixing this sequence and query the constructed Gaussian processes at these points. Since $\bm{a}_{t}$ is fixed, the entropy $H(\tilde{\bm{y}}_{t},\bm{a}_{t})=H(\tilde{\bm{y}}_{t})$ . Since, by definition of a Gaussian process, $\tilde{\bm{y}}_{t}$ follows a multivariate Gaussian distribution,

[TABLE]

where $K=\begin{bmatrix}k_{T}(a_{i},a_{j})\end{bmatrix}_{t\times t}$ . We can then compute $H(\tilde{\bm{y}}_{t})$ by

[TABLE]

where (12) comes from recursively expanding $H(\tilde{\bm{y}}_{\tau})$ . By (11) and (12),

[TABLE]

For the block diagonal matrix $K$ of size $t\times t$ , let $n_{i}$ denote the size of block $i$ and $B^{\prime}$ ( $B^{\prime}\leq|\mathcal{P}_{t}|$ ) be the total number of diagonal blocks up to a time $t$ ( $t\leq T$ ). Then we have

[TABLE]

where $\bm{1}$ is all-1 vector of proper length. In the above, (1) the equality on the first line uses the determinant of block-diagonal matrix equals to the product of determinant of diagonal blocks, 2) the equality on the last line is due to the matrix determinant lemma, and 3) the inequality on the last line is due to the AM-GM inequality and that $\sum_{i=1}^{B^{\prime}}n_{i}=t$ .

Next, since $|\mathcal{P}_{t}|\geq B^{\prime}$ and $\left(1+\frac{s^{-2}t}{x}\right)^{x}$ is increasing with $x$ (on $[1,\infty)$ ),

[TABLE]

Therefore, from (13) and (14),

[TABLE]

since arguments after (11) hold for all $t\leq T$ .

Since the function $h(\lambda)=\frac{\lambda}{\log(1+\lambda)}$ is increasing for non-negative $\lambda$ , $\lambda\leq\frac{s^{-2}_{T}}{\log(1+s^{-2}_{T})}\log(1+\lambda)$ for $\lambda\in[0,s^{-2}_{T}]$ . Since $\sigma_{T,t}(a)\in[0,1]$ for all $a$ ,

[TABLE]

for $t,T=0,1,2,\cdots$ . Since the partitions are nested, we have that for $T_{1}\leq T_{2}$ , $n_{T_{1},t}(a)\geq n_{T_{2},t}(a)$ , and thus $\sigma_{T_{1},t}^{2}(a)\leq\sigma_{T_{2},t}^{2}(a)$ . Suppose we query at points $a_{1},\cdots,a_{T}$ in the Gaussian process $\mathcal{GP}(0,k_{T}(\cdot,\cdot))$ . Then,

[TABLE]

where (17) uses (10), the second last inequality uses (16), and the last inequality uses (15). Finally, we optimize over $s_{T}$ . Since $s_{T}^{-2}=e-1$ minimizes $\frac{1+s^{-2}_{T}}{\log(1+s^{-2}_{T})}$ , we have

[TABLE]

The above argument proves (6).

Remark 3.

One important insight of our analysis is that this allows us to link the Hoeffding-type concentration term to the posterior variance of the constructed Gaussian processes. This connection is directly shown in (10). As we will discuss in Section 2.3.2, we can use this connection to improve the entire learning process via “softening”.

Next, we sketch the proofs of (7) and (8).

Proof of (7). Consider the partition $\mathcal{P}_{T}$ at time $T$ . We label the regions of the partitions by $j=1,2,\cdots,|\mathcal{P}_{T}|$ . Let $t_{j,i}$ be the time when the $i$ -th point in the $j$ -th region in $\mathcal{P}_{T}$ being selected. Let $b_{j}$ be the number of points in region $j$ . Since the partitions are nested, we have $1+n_{t_{j,i}-1}^{0}(x_{t_{j,i}})\geq i$ for all $i,j$ . We have, for $T\geq 1$ ,

[TABLE]

where (18) uses $1+n_{t_{j,i}-1}^{0}(x_{t_{j,i}})\geq i$ and (19) uses AM-GM inequality and that $\sum_{j=1}^{|\mathcal{P}_{T}|}b_{j}=T$ .

**Proof of (8). ** The idea is similar to that of (7). For $0<\alpha<1$ ,

[TABLE]

where (20) is due to the Hölder’s inequality and that $\sum_{j=1}^{|\mathcal{P}_{T}|}b_{j}=T$ .

Proof of Theorem 1

Now we are ready to prove Theorem 1. We can split the sum of regrets by

[TABLE]

Also, by Claim 2, with probability at least $1-\frac{1}{3\left\lfloor\sqrt{T}\right\rfloor^{3}}$ , (5) holds simultaneously for all $t=\left\lfloor\sqrt{T}\right\rfloor+1,\cdots,T$ ( $T\geq 2$ ). Thus for $T\geq 2$ , the event

[TABLE]

occurs with probability at most $\frac{1}{3\left\lfloor\sqrt{T}\right\rfloor^{3}}$ . Since $\frac{1}{3\left\lfloor\sqrt{T}\right\rfloor^{3}}\sim\frac{1}{3T^{3/2}}$ , we know $\sum_{T=2}^{\infty}\mathbb{P}(E_{T})<\infty$ . By the Borel-Cantelli lemma, we know $\mathbb{P}\left(\lim\sup_{T\rightarrow\infty}E_{T}\right)=0$ . In other words, with probability 1, $E_{T}$ occurs finitely many times. Thus, with probability 1, there exists a constant $T_{0}$ , such that the event $\overline{E}_{T}$ (negation of $E_{T}$ ) occurs for all $T>T_{0}$ . Also, from the Cauchy-Schwarz inequality (used below in the second line) and (6) (used below in the last line), we know that

[TABLE]

where the last equality is from the assumption that $|\mathcal{P}_{t}|=o(t^{\gamma})$ for some $\gamma<1$ . This means

[TABLE]

In addition, by the assumption that $D(p_{t}(a))=o(1)$ , we know $\lim\sup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}D(p_{t-1}(a))=0$ . The above two limits give us

[TABLE]

Combining all the facts above, we have $\lim_{T\rightarrow\infty}\frac{R_{T}}{T}=0$ with probability 1.

Adaptive partitioning: TUCB shall be implemented using regression trees or incremental regression trees. This naturally leverages the practical advantages of regression trees. Leaves in a regression tree form a partition of the space. Also, a regression tree is designed to fit an underlying function. This leads to an adaptive partitioning where the underlying function values within each region should be relatively similar to each other. We defer the discussion on the implementation we use in our experiments to Section 3. Please refer to (Breiman et al., 1984) for more details about regression tree fitting.

2.2 The Contextual TreeUCB algorithm

In this section, we present an extension of Algorithm 1 for the contextual stochastic bandit problem. The contextual stochastic bandit problem is an extension to the stochastic bandit problem. In this problem, at each time, context information is revealed, and the agent chooses an arm based on past experience as well as the contextual information. Formally, the expected payoff function $f$ is defined over the product of the context space $\mathcal{Z}$ and the arm space $\mathcal{A}$ and takes values from $[0,1]$ . Similar to the previous discussions, compactness of the product space and Lipschitzness of the payoff function are assumed. In addition, a mean zero, almost surely bounded noise that is independent of the expected reward function is added to the observed rewards. At each time $t$ , a contextual vector $z_{t}\in\mathcal{Z}$ is revealed and the agent plays an arm $a_{t}\in\mathcal{A}$ . The performance of the agent following algorithm Alg is measured by the cumulative contextual regret

[TABLE]

where $f(z_{t},a_{t}^{*})$ is the maximal value of $f$ given contextual information $z_{t}$ . A simple extension of Algorithm 1 can solve the contextual version problem. In particular, in the contextual case, we partition the joint space $\mathcal{Z}\times\mathcal{A}$ instead of the arm space $\mathcal{A}$ . As an analog to (2) and (1), we define the corrected count $n_{t}$ and the corrected average $m_{t}$ over the joint space $\mathcal{Z}\times\mathcal{A}$ with respect to the partition $\mathcal{P}_{t}$ of the joint space $\mathcal{Z}\times\mathcal{A}$ , and observations in the joint space $((z_{1},a_{1}),y_{1},\cdots,(z_{t},a_{t}),y_{t})$ . The guarantee of Algorithm 2 is in Theorem 2.

Theorem 2.

Suppose that the payoff function $f$ defined on a compact doubling metric space $(\mathcal{Z}\times\mathcal{A},d)$ satisfies $f(z,a)\in[0,1]$ for all $(z,a)$ and is Lipschitz. If the tree growing rule satisfies requirements 1-4 listed in Theorem 1, then $\lim_{T\rightarrow\infty}\frac{R_{T}^{c}(CTUCB)}{T}=0$ with probability 1.

Theorem 2 follows from Theorem 1. Since the point scattering inequality holds for any sequence of (context-)arms, we can replace regret with contextual regret and alter Claims 1 and 2 accordingly to prove Theorem 2.

In particular, Claims 1 and 2 extend to the contextual setting, as stated and proved below.

Claim 3.

For any context $z$ , arm $a$ , and time $t$ , with probability at most $\frac{1}{t^{4}}$ , we have:

[TABLE]

for a constant $C$ .

Proof.

First of all, when $t=1$ , this is trivially true by Lipschitzness. Now let us consider the case when $t\geq 2$ . Let us use $A_{1},A_{2},\cdots,A_{t}$ to denote the random variables of arms selected up to time $t$ , $Z_{1},Z_{2},\cdots,Z_{t}$ to denote the random context up to time $t$ and $Y_{1},Y_{2},\cdots,Y_{t}$ to denote random variables of rewards received up to time $t$ . Then the random variables $\left\{\sum_{t=1}^{T}\left(f(Z_{t},A_{t})-Y_{t}\right)\right\}$ is a martingale sequence. This is easy to verify since the noise is mean zero and independent. In addition, since there is no randomness in the partition formation (given a sequence of observations), for a fixed $a$ , we have the times $\mathbb{I}[(Z_{t},A_{i})\in p_{t-1}(z,a)]$ ( $i\leq t$ ) is measureable with respect to $\sigma(Z_{1},A_{1},Y_{1},\cdots,Z_{t},A_{t},Y_{t})$ . Therefore, the sequence $\left\{\sum_{i=1}^{t}\left(f(Z_{i},A_{i})-Y_{i}\right)\mathbb{I}[(Z_{i},A_{i})\in p_{t-1}(z,a)]\right\}_{t=1}^{T}$ is a skipped martingale. Since skipped martingale is also a martingale, we apply the Azuma-Hoeffding inequality (with sub-Gaussian tails) (Shamir, 2011). For simplicity, we write

[TABLE]

Combining this with Lipschitzness, we get there is a constant $C$ (depends on the a.s. bound of the reward, as a result of Hoeffding inequality), such that

[TABLE]

where (27) uses both the Lipschitzness and the Azuma-Hoeffding’s inequality. ∎

Claim 4.

At any $t$ , with probability at least $1-\frac{1}{t^{4}}$ , the single step contextual regret satisfies:

[TABLE]

for a constant $C$ . Here $a_{t}^{*}$ is the optimal arm for the context $z_{t}$ .

Proof.

By Claim 3, with probability at least $1-\frac{1}{t^{4}}$ , the following ((28) and (29)) hold simultaneously,

[TABLE]

This is true since we first take a one-sided version of Hoeffding-type tail bound in (24), and then take a union bound over the two points $(z_{t},a_{t})$ and $(z_{t},a_{t}^{*})$ . This first halves the probability bound and then doubles it. Then we take the complementary event to get (28) and (29) simultaneously hold with probability at least $1-\frac{1}{t^{4}}$ . We then take another union bound over time $t$ , as discussed in the main text. Note that throughout the proof, we do not need to take union bounds over all arms or all regions in the partition.

Equation 28 holds by algorithm definition. Otherwise we will not select $a_{t}$ at time $t$ . Combine (28) and (29), and we get

[TABLE]

∎

2.3 Use Cases of Point Scattering Inequalities

2.3.1 Recover Previous Bounds

In this section, we give examples of using the point scattering inequalities to derive regret bounds for other algorithms. For our purpose of illustrating the point scattering inequalities, the discussed algorithms are simplified. We also assume that the reward and the sub-Gaussianity are properly scaled so that the parameter before the Hoeffding-type concentration term is $1$ .

The UCB1 algorithm The classic UCB1 algorithm (Auer et al., 2002) assumes a finite set of arms, each having a different reward distribution. Following our notation, at time $t$ , the UCB1 algorithm plays

[TABLE]

Indeed, this equation can be interpreted as (4) under the discrete 0-1 metric: two points are distance zero if they coincide and distance 1 otherwise. Then from the point scattering inequality (6), we get for UCB1

[TABLE]

where $K$ is number of arms in the problem. This matches the gap-independent (independent of the reward gap between an arm and the optimal arm) bound derived using traditional methods in UCB1 algorithm (Auer et al., 2002; Bubeck and Cesa-Bianchi, 2012). In this analysis, we apply the point scattering inequality with the partition $\mathcal{P}_{t}$ being the set of arms at all $t$ .

**Finite Time Bound for Lipschitz Bandits and Lipschitz RL. ** As shown in Claim 2, the single step regret is bounded by a Hoeffding-type concentration and the diameter of selected region (due to Lipschitzness). Since the point scattering inequalities provide a bound of the overall summation of the Hoeffding terms, we can design and analyze many partition-based Lipschitz algorithms using point scattering inequalities We can do this since the partitioning is up to our choice. Examples include the UniformMesh algorithm discussed by (Kleinberg et al., 2008), and parition-based Lipschitz reinforcement learning algorithm recently studied (e.g. Ni et al., 2019).

2.3.2 Hierarchical Bayesian Method for Lipschitz Bandits

Existing Lipschitz bandit algorithms (e.g., Kleinberg et al., 2008) partition the arm space into disjoint bins. Based on this partition, arms in two different bin do not give information about each other, and all arms within the same bins are viewed as the same. This implicit assumption, however, is obviously untrue. On the other hand, imposing a strong prior on the reward function would break the Lipschitzness assumption. To simultaneously address the above two difficulties, we link the learned tree (or partition) to a Bayesian model in light of our analysis of (6). This new viewpoint allows us to “soften” the entire model using a hierarchical Bayesian method.

Formally, at each time $t$ , we consider the following hierarchical Bayesian problem with respect to the learned partition $\mathcal{P}_{t}$ . Note that this hierarchical Bayesian model is updated whenever we update the partition. This is roughly the same as make a finite partition and treat each bin as an arm, and do not impose extra structures on the reward function. Let $\mathcal{P}_{t}$ be the learnt partition such that each bin is a rectangle. Then the kernel function is defined as

[TABLE]

where $p$ are regions in $\mathcal{P}_{T}$ , and $\widetilde{k}_{T}^{(p)}(\cdot,\cdot)$ is defined as follows. For a partition $p=\prod_{i=1}^{d}[a_{i},b_{i}]$ , define

[TABLE]

where $\bm{x}_{i}$ (resp. $\bm{x}_{i}^{\prime}$ ) are the $i$ -th entry of $\bm{x}_{i}$ (resp. $\bm{x}^{\prime}$ ), and $\alpha_{T}>0$ are parameters that controls how smooth are the smoothed tree metrics. Given a learned partition $\mathcal{P}_{T}=\{p_{1},p_{2},\cdots,p_{K}\}$ , where $p_{j}=\prod_{i=1}^{d}[a_{j}^{(i)},b_{j}^{(i)}]$ , we construct the following hierarchical Bayesian model

[TABLE]

This hierarchical model has several advantages: (1) It respect Lipschitzness. As we collect more observations, the partition can grow arbitrarily fine, and the approximation can be arbitrarily close to an extract indicator function. Because of this, the no prior smoothness assumption on the true (unknown) reward function is needed. (2) It treats arms within the same bin differently, and can use information across bins.

Going back to bandit learning process, we can replace the mean and/or confidence intervals of UCB index with the posteriors of this hierarchical bayesian model. As we discussed in Remark 3, a key insight of our analysis is the link between the Hoeffding-type concentration interval to the posterior variance of the Gaussian processes, which allows us to do this principled substitute. In Section 3.1, we empirically study this hierarchical Bayesian model.

3 Empirical Study

Since the TreeUCB algorithm imposes only mild constraints on tree formation, we use greedy decision tree splitting to fit the reward function, using the following splitting rule: we find the split that maximizes the reduction in the Mean Absolute Error (MAE), and we stop growing the tree once the maximal possible reduction is below 0.001.

3.1 Gaussian Processes with Learned Kernel

In this section, we compare several baselines, including piecewise constant estimates (within each bin), a Gaussian process regression with box kernel (left subfigure in Figure 2) and Gaussian process regression with softened box kernel (right subfigure in Figure 2). The splitting procedure is the same for all methods, so the partitions are the same for the methods. Our results, shown in Figure 3, demonstrates a transition from the hardness of the piecewise constant estimate to the softness of the Gaussian process regression with the softened kernel. This justifies the “softening” discussed in Section 2.3.2. The Gaussian process kernel parameters for $GP_{S,1},GP_{S,2},GP_{S,3},GP_{S,4},GP_{S,5}$ , namely $\alpha_{T}$ in Eq. (33), were set to $10,50,100,500,1000$ respectively.

3.2 Application to Neural Network Tuning

One application of stochastic bandit algorithms is zeroth order optimization. In this section, we apply TUCB to tuning neural networks. In this setting, we treat the hyperparameter configurations (e.g., learning rate, network architecture) as the arms of the bandit, and use validation accuracy as reward. The task is to select a hyperparameter configuration and train the network to observe the validation accuracies, and find the best hyperparameter configuration rapidly. This experiment shows that TUCB can compete with the state-of-the-art tuning methods on such hard real-world tasks.

The architecture and the hyperparameter space for the simple Multi-Layer Perceptron (MLP) for the MNIST dataset are: in the feed-forward direction, there are the input layer, the fully connected hidden layer with dropout ensemble, and then the output layer. The hyperparameter search space is five dimensional, including number of hidden neurons (range $[10,784]$ ), learning rate ( $[0.0001,4)$ ), dropout rate ( $[0.1,0.9)$ ), batch size ( $[10,500]$ ), and number of iterations ( $[30,243]$ ).

The details of the CNN setting for SVHN and CIFAR-10 can be found in Tables 1 and 2. The results are found in Figure 4, indicating that TUCB outperforms existing state-of-the-art software packages for tuning neural network methods.

4 Conclusion

We propose the TreeUCB and the Contextual TreeUCB frameworks that use decision trees (regression trees) to flexibly partition the arm space and the context-arm space as an Upper Confidence Bound strategy is played across the partition regions. We also provide regret analysis via the point scattering inequalities. We provide implementations using decision trees that learn the partition. TUCB is competitive with the state-of-the-art hyperparameter optimization methods in hard tasks like neural-net tuning, and could save substantial computing resources. This suggests that, in addition to random search and Bayesian optimization methods, more bandit algorithms should be considered as benchmarks for difficult real-world problems such as neural network tuning.

Acknowledgement

The authors are grateful to Aaron J Fisher and Tiancheng Liu for insightful discussions. The authors thank anonymous reviewers for valuable feedback. The project is partially supported by the Alfred P. Sloan Foundation through the Duke Energy Data Analytics fellowship.

Bibliography29

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abbasi-yadkori et al . (2011) Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. 2011. Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems 24 . Curran Associates, Inc., 2312–2320.
3Agrawal and Goyal (2012) Shipra Agrawal and Navin Goyal. 2012. Analysis of Thompson Sampling for the Multi-armed Bandit Problem (Proceedings of Machine Learning Research, Vol. 23) . JMLR Workshop and Conference Proceedings, Edinburgh, Scotland, 39.1–39.26.
4Agrawal and Goyal (2013) Shipra Agrawal and Navin Goyal. 2013. Thompson Sampling for Contextual Bandits with Linear Payoffs (Proceedings of Machine Learning Research, Vol. 28) . PMLR, Atlanta, Georgia, USA, 127–135.
5Auer (2002) Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, Nov (2002), 397–422.
6Auer et al . (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.
7Auer et al . (2007) Peter Auer, Ronald Ortner, and Csaba Szepesvári. 2007. Improved rates for the stochastic continuum-armed bandit problem. In International Conference on Computational Learning Theory . Springer.
8Breiman et al . (1984) Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees . CRC press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Towards Practical Lipschitz Bandits

Abstract

1 Introduction

2 Main results

2.1 The TreeUCB framework

Definition 1** (Region Selection Function).**

Definition 2**.**

Remark 1**.**

Theorem 1**.**

Claim 1**.**

Claim 2**.**

Lemma 1** (Point Scattering Inequalities).**

Remark 2**.**

2.1.1 Proof of (6)

Proposition 1**.**

Proof.

Remark 3**.**

Proof of Theorem 1

2.2 The Contextual TreeUCB algorithm

Theorem 2**.**

Claim 3**.**

Proof.

Claim 4**.**

Proof.

2.3 Use Cases of Point Scattering Inequalities

2.3.1 Recover Previous Bounds

2.3.2 Hierarchical Bayesian Method for Lipschitz Bandits

3 Empirical Study

3.1 Gaussian Processes with Learned Kernel

3.2 Application to Neural Network Tuning

4 Conclusion

Acknowledgement

Definition 1 (Region Selection Function).

Definition 2.

Remark 1.

Theorem 1.

Claim 1.

Claim 2.

Lemma 1 (Point Scattering Inequalities).

Remark 2.

Proposition 1.

Remark 3.

Theorem 2.

Claim 3.

Claim 4.