Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial   Multi-armed Bandits

Zifan Li; Ambuj Tewari

arXiv:1702.05536·cs.LG·January 9, 2018

Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

Zifan Li, Ambuj Tewari

PDF

Open Access

TL;DR

This paper extends the analysis of follow the perturbed leader algorithms for adversarial multi-armed bandits by introducing the generalized hazard rate, allowing for regret bounds with distributions like Gaussian and uniform.

Contribution

It introduces the generalized hazard rate concept and provides regret bounds for FTPL algorithms without the bounded hazard rate assumption, including for Gaussian and uniform distributions.

Findings

01

Gaussian distribution can achieve near-optimal regret.

02

Regret bounds are established for distributions with unbounded support.

03

Disproves the conjecture that Gaussian cannot be used for low-regret algorithms.

Abstract

Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret bounds for FTPL algorithms beyond the bounded hazard rate condition. There are good reasons to do so: natural distributions such as the uniform and Gaussian violate the condition. We give regret bounds for both bounded support and unbounded support distributions without assuming the hazard rate condition. We also disprove a conjecture that the Gaussian distribution cannot lead to a low-regret algorithm. In fact, it turns out that it leads to near optimal regret, up to logarithmic factors. A key…

Equations226

Regret_{T} := i \in [N] max t = 1 \sum T (g_{t, i} - g_{t, i_{t}}) .

Regret_{T} := i \in [N] max t = 1 \sum T (g_{t, i} - g_{t, i_{t}}) .

G_{t} := s = 1 \sum t g_{s} .

G_{t} := s = 1 \sum t g_{s} .

E [p_{t, i} \overset{g}{^}_{t, i} ∣ i_{1 : t - 1}] = p_{t, i} g_{t, i} .

E [p_{t, i} \overset{g}{^}_{t, i} ∣ i_{1 : t - 1}] = p_{t, i} g_{t, i} .

E [\overset{g}{^}_{t} ∣ i_{1 : t - 1}] ⪰ g_{t},

E [\overset{g}{^}_{t} ∣ i_{1 : t - 1}] ⪰ g_{t},

E Regret_{T} = Φ (G_{T}) - E [t = 1 \sum T ⟨ p_{t}, g_{t} ⟩] .

E Regret_{T} = Φ (G_{T}) - E [t = 1 \sum T ⟨ p_{t}, g_{t} ⟩] .

E Regret_{T} \leq overestimation penalty \tilde{Φ} (0) + E underestimation penalty Φ (\hat{G}_{T}) - \tilde{Φ} (\hat{G}_{T}) + E t = 1 \sum T divergence penalty E [D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) ∣ i_{1 : t - 1}],

E Regret_{T} \leq overestimation penalty \tilde{Φ} (0) + E underestimation penalty Φ (\hat{G}_{T}) - \tilde{Φ} (\hat{G}_{T}) + E t = 1 \sum T divergence penalty E [D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) ∣ i_{1 : t - 1}],

Regret_{T} = Φ (G_{T}) - t = 1 \sum T ⟨ e_{i_{t}}, g_{t} ⟩ .

Regret_{T} = Φ (G_{T}) - t = 1 \sum T ⟨ e_{i_{t}}, g_{t} ⟩ .

E Regret_{T} = Φ (G_{T}) - E [t = 1 \sum T ⟨ p_{t}, \overset{g}{^}_{t} ⟩] .

E Regret_{T} = Φ (G_{T}) - E [t = 1 \sum T ⟨ p_{t}, \overset{g}{^}_{t} ⟩] .

Φ (G_{T}) \leq Φ (E [\hat{G}_{T}]) \leq E [Φ (\hat{G}_{T})],

Φ (G_{T}) \leq Φ (E [\hat{G}_{T}]) \leq E [Φ (\hat{G}_{T})],

E Regret_{T} \leq E [Φ (\hat{G}_{T}) - t = 1 \sum T ⟨ p_{t}, \overset{g}{^}_{t} ⟩] .

E Regret_{T} \leq E [Φ (\hat{G}_{T}) - t = 1 \sum T ⟨ p_{t}, \overset{g}{^}_{t} ⟩] .

D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) = \tilde{Φ} (\hat{G}_{t}) - \tilde{Φ} (\hat{G}_{t - 1}) - ⟨ \nabla \tilde{Φ} (\hat{G}_{t - 1}), \hat{G}_{t} - \hat{G}_{t - 1} ⟩,

D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) = \tilde{Φ} (\hat{G}_{t}) - \tilde{Φ} (\hat{G}_{t - 1}) - ⟨ \nabla \tilde{Φ} (\hat{G}_{t - 1}), \hat{G}_{t} - \hat{G}_{t - 1} ⟩,

- t = 1 \sum T ⟨ p_{t}, \overset{g}{^}_{t} ⟩

- t = 1 \sum T ⟨ p_{t}, \overset{g}{^}_{t} ⟩

= - t = 1 \sum T ⟨ \nabla \tilde{Φ} (\hat{G}_{t - 1}), \hat{G}_{t} - \hat{G}_{t - 1} ⟩

= t = 1 \sum T (D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) + \tilde{Φ} (\hat{G}_{t - 1}) - \tilde{Φ} (\hat{G}_{t}))

= \tilde{Φ} (\hat{G}_{0}) - \tilde{Φ} (\hat{G}_{T}) + t = 1 \sum T D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) .

\tilde{Φ} (G; D) = E_{Z_{1}, \dots, Z_{N} \sim i.i.d D} Φ (G + Z),

\tilde{Φ} (G; D) = E_{Z_{1}, \dots, Z_{N} \sim i.i.d D} Φ (G + Z),

Φ (G) + E [Z_{1}] \leq \tilde{Φ} (G) \leq Φ (G) + E M A X (N)

Φ (G) + E [Z_{1}] \leq \tilde{Φ} (G) \leq Φ (G) + E M A X (N)

E_{Z_{1}, \dots, Z_{N}} [i max Z_{i}] \leq E M A X (N) .

E_{Z_{1}, \dots, Z_{N}} [i max Z_{i}] \leq E M A X (N) .

Φ (G) + E [Z_{1}]

Φ (G) + E [Z_{1}]

\leq E [i max (G_{i} + Z_{i})] = \tilde{Φ} (G)

\leq E [i max G_{i} + i max Z_{i}] = i max G_{i} + E [i max Z_{i}] = Φ (G) + E [i max Z_{i}] .

\nabla \tilde{Φ} (G; D) = E_{Z_{1}, \dots, Z_{N} \sim i.i.d D} e_{i^{*}}, where i^{*} = i = 1, \dots, N arg max {G_{i} + Z_{i}} .

\nabla \tilde{Φ} (G; D) = E_{Z_{1}, \dots, Z_{N} \sim i.i.d D} e_{i^{*}}, where i^{*} = i = 1, \dots, N arg max {G_{i} + Z_{i}} .

\nabla_{i} \tilde{Φ} (G) = \frac{\partial Φ ~}{\partial G _{i}}

\nabla_{i} \tilde{Φ} (G) = \frac{\partial Φ ~}{\partial G _{i}}

= E_{\tilde{G}_{- i}} [P_{Z_{i}} [Z_{i} > \tilde{G}_{- i} - G_{i}]] = E_{\tilde{G}_{- i}} [1 - F (\tilde{G}_{- i} - G_{i})]

\nabla_{ii}^{2} \tilde{Φ} (G)

\nabla_{ii}^{2} \tilde{Φ} (G)

= E_{\tilde{G}_{- i}} [\frac{\partial}{\partial G _{i}} (1 - F (\tilde{G}_{- i} - G_{i}))] = E_{\tilde{G}_{- i}} f (\tilde{G}_{- i} - G_{i}) .

z sup \frac{f ( z )}{1 - F ( z )} < \infty,

z sup \frac{f ( z )}{1 - F ( z )} < \infty,

\mathbb{E}\mathrm{Regret}_{T}=O\Big{(}\sqrt{NT\times EMAX(N)}\Big{)}.

\mathbb{E}\mathrm{Regret}_{T}=O\Big{(}\sqrt{NT\times EMAX(N)}\Big{)}.

E [D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) ∣ i_{1 : t - 1}] = i \in supp (p_{t}) \sum p_{t, i} \int_{0}^{\frac{g _{t, i}}{p _{t, i}}} E_{\hat{G}_{- i}} [\int_{0}^{s} f (\hat{G}_{- i} - \hat{G}_{t - 1, i} + r) d r] d s

E [D_{\tilde{Φ}} (\hat{G}_{t}, \hat{G}_{t - 1}) ∣ i_{1 : t - 1}] = i \in supp (p_{t}) \sum p_{t, i} \int_{0}^{\frac{g _{t, i}}{p _{t, i}}} E_{\hat{G}_{- i}} [\int_{0}^{s} f (\hat{G}_{- i} - \hat{G}_{t - 1, i} + r) d r] d s

h_{i} (r) = D_{\tilde{Φ}} (\hat{G} - r e_{i}, \hat{G}),

h_{i} (r) = D_{\tilde{Φ}} (\hat{G} - r e_{i}, \hat{G}),

E [D_{\tilde{Φ}} (\hat{G} + \overset{g}{^}, \hat{G}) ∣ i_{1 : t - 1}]

E [D_{\tilde{Φ}} (\hat{G} + \overset{g}{^}, \hat{G}) ∣ i_{1 : t - 1}]

= i \in supp (p) \sum p_{i} h_{i} (∣ g_{i} / p_{i} ∣) = i \in supp (p) \sum p_{i} \int_{0}^{∣ g_{i} / p_{i} ∣} \int_{0}^{s} h_{i}^{''} (r) d r d s

= i \in supp (p) \sum p_{i} \int_{0}^{∣ g_{i} / p_{i} ∣} \int_{0}^{s} \nabla_{ii}^{2} \tilde{Φ} (\hat{G} - r e_{i}) d r d s

= i \in supp (p) \sum p_{i} \int_{0}^{∣ g_{i} / p_{i} ∣} \int_{0}^{s} E_{\hat{G}_{- i}} f (\hat{G}_{- i} - \hat{G}_{i} + r) d r d s

= i \in supp (p_{t}) \sum p_{t, i} \int_{0}^{∣ g_{i} / p_{i} ∣} E_{\hat{G}_{- i}} [\int_{0}^{s} f (\hat{G}_{- i} - \hat{G}_{i} + r) d r] d s .

F_{η} (z) = F (\frac{z}{η}), f_{η} (z) = \frac{f ( \frac{z}{η} )}{η}, F_{η}^{- 1} (y) = η F^{- 1} (y) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics

Full text

Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

Zifan Li

University of Michigan

[email protected]

Ambuj Tewari

University of Michigan

[email protected]

Abstract

Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret bounds for FTPL algorithms beyond the bounded hazard rate condition. There are good reasons to do so: natural distributions such as the uniform and Gaussian violate the condition. We give regret bounds for both bounded support and unbounded support distributions without assuming the hazard rate condition. We also disprove a conjecture that the Gaussian distribution cannot lead to a low-regret algorithm. In fact, it turns out that it leads to near optimal regret, up to logarithmic factors. A key ingredient in our approach is the introduction of a new notion called the generalized hazard rate.

Keywords: online learning, regret, multi-armed bandits, follow the perturbed leader, gradient based algorithms

1 Introduction

Starting from the seminal work of Hannan (1957) and later developments due to Kalai and Vempala (2005), perturbation based algorithms (called “Follow the Perturbed Leader (FTPL)”) have occupied a central place in online learning. Another major family of online learning algorithms, called “Follow the Regularized Leader (FTRL)”, is based on the idea of regularization. In special cases, such as the exponential weights algorithm for the experts problem, it has been folk knowledge that regularization and perturbation ideas are connected. That is, the exponential weights algorithm can be understood as either using negative entropy regularization or Gumbel distributed perturbations (for example, see the discussion in Abernethy et al. (2014)).

Recent work have begun to further uncover the connections between perturbation and regularization. For example, in online linear optimization, one can understand regularization and perturbation as simply two different ways to smooth a non-smooth potential function. The former corresponds to infimal convolution smoothing and the latter corresponds to stochastic (or integral convolution) smoothing (Abernethy et al., 2014). Having a generic framework for understanding perturbations allows one to study a wide variety of online linear optimization games and a number of interesting perturbations.

FTRL and FTPL algorithms have also been used beyond “full information” settings. “Full information” refers to the fact that the learner observes the entire move of the adversary. The multi-armed bandit problem is one of the most fundamental examples of “partial information” settings. Regret analysis of the multi-armed bandit problem goes back to the work of Robbins (1952) who formulated the stochastic version of the problem. The non-stochastic, or adversarial, version was formulated by Auer et al. (2002), who provided the EXP3 algorithm achieving $O(\sqrt{NT\log N})$ regret in $T$ rounds with $N$ arms. They also showed a lower bound of $\Omega(\sqrt{NT})$ , which was later matched by the Poly-INF algorithm (Audibert and Bubeck, 2009; Audibert et al., 2011). The Poly-INF algorithm can be interpreted as an FTRL algorithm with negative Tsallis entropy regularization (Audibert et al., 2011; Abernethy et al., 2015). For a recent survey of both stochastic and non-stochastic bandit problems, see Bubeck and Cesa-Bianchi (2012).

For the non-stochastic multi-armed bandit problem, Kujala and Elomaa (2005) and Poland (2005) both showed that using the exponential (actually double exponential/Laplace) distribution in an FTPL algorithm coupled with standard unbiased estimation technique yields near-optimal $O(\sqrt{NT\log N})$ regret. Unbiased estimation needs access to arm probabilities that are not explicitly available when using an FTPL algorithm. Neu and Bartók (2013) introduced the geometric resampling scheme to approximate these probabilities while still guaranteeing low regret. Recently, Abernethy et al. (2015) analyzed FTPL for adversarial multi-armed bandits and provided regret bounds under the condition that the hazard rate of the perturbation distribution is bounded. This condition allowed them to consider a variety of perturbation distributions beyond the exponential, such as Gamma, Gumbel, Frechet, Pareto, and Weibull.

Unfortunately, the bounded hazard rate condition is violated by two of the most widely known distributions: namely the uniform111The uniform distribution is also historically significant as it was used in the original FTPL algorithm of Hannan (1957). and the Gaussian distributions. Therefore, the results of Abernethy et al. (2015) say nothing about the regret incurred in an adversarial multi-armed bandit problem when we use these distributions (without forced exploration) to generate perturbations. Contrast this to the full information experts setting where using these distributions as perturbations yields optimal $\sqrt{T}$ regret and even yields the optimal $\sqrt{\log N}$ dependence on the dimension in the Gaussian case (Abernethy et al., 2014).

The Gaussian distribution has lighter tails than the exponential. The hazard rate of a Gaussian increases linearly on the real line (and is hence unbounded) whereas the exponential has a constant hazard rate. Does having too light a tail make a perturbation inherently bad? The uniform is even worse from a light tail point of view: it has bounded support! In fact, Kujala and Elomaa (2005) had trouble dealing with the uniform distribution and remarked, “we failed to analyze the expert setting when the perturbation distribution was uniform.” Does having a bounded support make a perturbation even worse? Or is it that the hazard rate condition is just a sufficient condition without being anywhere close to necessary for a good regret bound to exist. The analysis of Abernethy et al. (2015) suggests that perhaps a bounded hazard rate is critical. They even made the following conjecture.

Conjecture 1.

If a distribution $\mathcal{D}$ has a monotonically increasing hazard rate $h_{\mathcal{D}}(x)$ that does not converge as $x\to+\infty$ (e.g., Gaussian), then there is a sequence of gains that causes the corresponding FTPL algorithm to incur at least a linear regret.

The main contribution of this paper is to provide answers to the questions raised above. First, we show that boundedness of the hazard rate is certainly not a requirement for achieving sublinear (in $T$ ) regret. Bounded support distributions, like the uniform, violate the boundedness condition on the hazard rate in the most extreme way. Their hazard rate blows up not just asymptotically at infinity, as in the Gaussian case, but as one approaches the right edge of the support. Yet, we can show (Corollary 3.3) that using the uniform distribution results in a regret bound of $O((NT)^{2/3})$ . This bound is clearly not optimal. But optimality is not the point here. What is surprising, especially if one regards Conjecture 1 as plausible, is that a non-trivial sublinear bound holds at all. In fact, we show (Corollary 3.4) that using any continuous distribution with bounded support and bounded density results in a sublinear regret bound.

Second, moving beyond bounded support distributions to ones with unbounded support, we settle Conjecture 1 in the negative. In Theorem 4.6 we show that, instead of suffering linear regret as predicted by Conjecture 1, a perturbation algorithm using the Gaussian distribution enjoys a near optimal regret bound of $O(\sqrt{NT\log N}\log T)$ . A key ingredient in our approach is a new quantity that we call the generalized hazard rate of a distribution. We show that bounded generalized hazard rate is enough to guarantee sublinear regret in $T$ (Theorem 4.2).

Finally, we investigate the relationship between tail behavior of random perturbations and the regret they induce. We show that heavy tails, along with some fairly mild assumptions, guarantee a bounded hazard rate (Theorem 4.9) and hence previous results can yield regret bounds for these perturbations. However, light tails can fail to have a bounded hazard rate. Nevertheless, we show that under reasonable conditions, light tailed distributions do have a bounded generalized hazard rate (Theorem 4.10). This result allows us to show that reasonably behaved light-tailed distributions lead to near optimal regret (Corollary 4.11). In particular, the exponential power (or generalized normal) family of distributions yields near optimal regret (Theorem 4.13)

2 Follow the Perturbed Leader Algorithm for Bandits

Recall the setting of the adversarial multi-armed bandit problem (Auer et al., 2002). An adversary (or Nature) chooses gain vectors $g_{t}\in[-1,0]^{N}$ for $1\leq t\leq T$ ahead of the game. Such an adversary is called oblivious. At round $t=1,\ldots,T$ in a repeated game, the learner must choose a distribution $p_{t}\in\Delta_{N}$ over the set of $N$ available arms (or actions). The learner plays action $i_{t}$ sampled according to $p_{t}$ and accumulates the gains $g_{t,{i_{t}}}\in[-1,0]$ . The learner observes only $g_{t,{i_{t}}}$ and receives no information about the values $g_{t,j}$ for $j\neq i_{t}$ .

The learner’s goal is to minimize the regret. Regret is defined to be the difference in the realized gains and the gains of the best fixed action in hindsight:

[TABLE]

To be precise, we consider the expected regret, where the expectation is taken with respect to the learner’s randomization. Note that, under an oblivious adversary, the only random variables in the above expression are the actions $i_{t}$ of the learner. For convenience, define the cumulative gain vectors $G_{t},t=1,2,\dots,T$ by

[TABLE]

2.1 The Gradient-Based Algorithmic Template

We will consider the algorithmic template described in Framework 1, which is the Gradient Based Prediction Algorithm (GBPA) (see, for example, Abernethy et al. (2015)). Let $\Delta^{N}$ be the $(N-1)$ -dimensional probability simplex in $\mathbb{R}^{N}$ . Denote the standard basis vector along the $i$ th dimension by $\mathbf{e}_{i}$ . At any round $t$ , the action choice $i_{t}$ is made by sampling from the distribution $p_{t}$ which is obtained by applying the gradient of a convex function ${\tilde{\Phi}}$ to the estimate $\hat{G}_{t-1}$ of the cumulative gain vector so far. The choice of ${\tilde{\Phi}}$ is flexible but it must be a differentiable convex function such that its gradient is always in $\Delta^{N}$ .

Note that we do not require the range of $\nabla{\tilde{\Phi}}$ be contained in the interior of the probability simplex. If we required the gradient to lie in the interior, we would not be able to deal with bounded support distributions such as the uniform distribution. Even though some entries of the probability vector $p_{t}$ might be [math], the estimation step is always well defined since $p_{t,i_{t}}>0$ . But allowing $p_{t,i}$ to be zero means that $\hat{g}_{t}$ is not exactly an unbiased estimator of $g_{t}$ . Instead, it is an unbiased estimator on the support of $p_{t}$ . That is, $\mathbb{E}[\hat{g}_{t,i}|i_{1:t-1}]=g_{t,i}$ for any $i$ such that $p_{t,i}>0$ . Here, $i_{1:t-1}$ is shorthand for $i_{1},\ldots,i_{t-1}$ . Therefore, irrespective of whether $p_{t,i}=0$ or not, we always have

[TABLE]

When $p_{t,i}=0$ , we have $\hat{g}_{t,i}=0$ but $g_{t,i}\leq 0$ , which means that $\hat{g}_{t}$ overestimates $g_{t}$ outside the support of $p_{t}$ . Hence, we also have

[TABLE]

where $\succeq$ means element-wise greater than.

We now present a basic result bounding the expected regret of GBPA in the multi-armed bandit setting. It is basically just a simple modification of the arguments in Abernethy et al. (2015) to deal with the possibility that $p_{t,i}=0$ . We state and prove this result here for completeness without making any claim of novelty.

Lemma 2.1.

(Decomposition of the Expected Regret)* Define the non-smooth potential $\Phi(G)=\max_{i}G_{i}$ . The expected regret of GBPA $({\tilde{\Phi}})$ can be written as*

[TABLE]

Furthermore, the expected regret of GBPA $({\tilde{\Phi}})$ can be bounded by the sum of an overestimation, an underestimation, and a divergence penalty:

[TABLE]

where the expectations are over the sampling of $i_{t}$ and $D_{{\tilde{\Phi}}}$ is the Bregman divergence induced by ${\tilde{\Phi}}$ .

Proof.

First, note that the regret, by definition, is

[TABLE]

Under an oblivious adversary, only the summation on the right hand side is random. Moreover $\mathbb{E}[\left\langle\mathbf{e}_{i_{t}},g_{t}\right\rangle|i_{1:t-1}]=\left\langle p_{t},g_{t}\right\rangle$ . This proves the claim in (4).

From (2), we know that $\mathbb{E}[\left\langle p_{t},\hat{g}_{t}\right\rangle|i_{1:t-1}]=\left\langle p_{t},g_{t}\right\rangle$ even if some entries in $p_{t}$ might be zero. Therefore, we have

[TABLE]

From (3), we know that $G_{T}\leq\mathbb{E}[\hat{G}_{T}]$ . This implies

[TABLE]

where the first inequality is because $G\succeq G^{\prime}\Rightarrow\Phi(G)\geq\Phi(G^{\prime})$ , and the second inequality is due to the convexity of $\Phi$ . Plugging (7) into (6) yields

[TABLE]

Now, recalling the definition of Bregman divergence

[TABLE]

we can write,

[TABLE]

The proof ends by plugging (10) into (8) and noting that ${\tilde{\Phi}}(\hat{G}_{0})={\tilde{\Phi}}(0)$ is not random. ∎

2.2 Stochastic Smoothing of Potential Function

Let $\mathcal{D}$ be a continuous distribution with finite expectation, probability density function $f$ , and cumulative distribution function $F$ . Consider GBPA with potential function of the form:

[TABLE]

which is a stochastic smoothing of the non-smooth function $\Phi(G)=\max_{i}G_{i}$ . Note that $Z=(Z_{1},\ldots,Z_{N})\in\mathbb{R}^{N}$ . We will often hide the dependence on the distribution $\mathcal{D}$ if the distribution is obvious from the context or when the dependence on $\mathcal{D}$ is not of importance in the argument. Since $\Phi$ is convex, ${\tilde{\Phi}}$ is also convex. For stochastic smoothing, we have the following result to control the underestimation and overestimation penalty.

Lemma 2.2.

For any $G$ , we have

[TABLE]

where $EMAX(N)$ is any function such that

[TABLE]

In particular, this implies that the overestimation penalty ${\tilde{\Phi}}(0)$ is upper bounded by $\Phi(0)+EMAX(N)=EMAX(N)$ and the underestimation penalty $\Phi(\hat{G}_{T})-{\tilde{\Phi}}(\hat{G}_{T})$ is upper bounded by $-\mathbb{E}[Z_{1}]$ .

Proof.

We have,

[TABLE]

Noting that $\mathbb{E}[\max_{i}Z_{i}]\leq EMAX(N)$ finishes the proof. ∎

Observe that $\Phi(G+Z)$ as a function of $G$ is differentiable with probability $1$ (under the randomness of the $Z_{i}$ ’s) due to the fact that $Z_{i}$ ’s are random variables with a density. By Proposition 2.3 of Bertsekas (1973), we can swap the order of differentiation and expectation:

[TABLE]

Note that, for any $G$ , the random index $i^{*}$ is unique with probability $1$ . Hence, ties between arms can be resolved arbitrarily. It is clear from above that $\nabla{\tilde{\Phi}}$ , being an expectation of vectors in the probability simplex, is in the probability simplex. Thus, it is a valid potential to be used in Framework 1. Now we derive an identity to write the gradient of the smoothed potential function in terms of the expectation of the cumulative distribution function,

[TABLE]

where $\tilde{G}_{-i}=\max_{j\neq i}G_{j}+Z_{j}$ . If $\mathcal{D}$ has unbounded support then this partial derivative is non-zero for all $i$ given any $G$ . However, it can be zero if $\mathcal{D}$ has bounded support. Similarly, we have the following useful identity that writes the diagonal of the Hessian of the smoothed potential function in terms of the expectation of the probability density function.

[TABLE]

2.3 Connection to Follow the Perturbed Leader

The sampling step of Framework 1 with a stochastically smoothed $\Phi$ as the potential ${\tilde{\Phi}}$ (Equation 11) can be done efficiently. Instead of evaluating the expectation (Equation 13), we just take a random sample. Doing so gives us an equivalent of Follow the Perturbed Leader Algorithm (FTPL) (Kalai and Vempala, 2005) applied to the bandit setting. On the other hand, the estimation step is hard because generally there is no closed-form expression for $\nabla{\tilde{\Phi}}$ .

To address this issue, Neu and Bartók (2013) proposed Geometric Resampling (GR), an iterative resampling process to estimate $\nabla{\tilde{\Phi}}$ (with bias). They showed that the extra regret after stopping at $M$ iterations of GR introduces an estimation bias that is at most $\frac{NT}{eM}$ as an additive term. That is, all GBPA regret bounds that we prove will hold for the corresponding FTPL algorithm that does $M$ iterations of GR at every time step, with an extra additive $\frac{NT}{eM}$ term. This extra term does not affect the regret rate as long as $M=\sqrt{NT}$ , because the lower bound for any adversarial multi-armed bandit algorithm is of the order $\sqrt{NT}$ .

2.4 The Role of the Hazard Rate and Its Limitation

In previous work, Abernethy et al. (2015) proved that for a continuous random variable $Z$ with finite and nonnegative expectation and support on the whole real line $\mathbb{R}$ , if the hazard rate of the random variable is bounded, i.e,

[TABLE]

then the expected regret of GBPA can be upper bounded as

[TABLE]

Common families of distributions whose regret can be controlled in this way include Gumbel, Frechet, Weibull, Pareto, and Gamma (see Abernethy et al. (2015) for details). However, there are many other families of distributions where the hazard rate condition fails. For example, if the random variable has a bounded support, then the hazard rate would certainly explode at the end of the support. This is, in some sense, an extreme case of violation because the random variable does not even have a tail. There are also some random variables that do have support on $\mathbb{R}$ but have unbounded hazard rate, e.g. Gaussian, where the hazard rate monotonically increases to infinity. How can we perform analyses of the expected regret of GBPA using those random variables as perturbations? To address these issues, we need to go beyond the hazard rate.

3 Perturbations with Bounded Support

In this section, we prove that GBPA with any continuous distribution that has bounded support and bounded density enjoys sublinear expected regret. From Lemma 2.1 we see that the expected regret can be upper bounded by the sum of three terms. The overestimation penalty can be bounded very easily via Lemma 2.2 for a distribution with bounded support. The underestimation penalty is non-positive as long as the distribution has non-negative expectation. The only term that needs to be controlled with some effort is the divergence penalty.

We first present a general lemma that allows us to write the divergence penalty for a stochastically smoothed potential ${\tilde{\Phi}}$ as a sum involving certain double integrals.

Lemma 3.1.

When using a stochastically smoothed potential as in (11), the divergence penalty can be written as

[TABLE]

where $p_{t}=\nabla{\tilde{\Phi}}(\hat{G}_{t-1})$ , $\hat{G}_{-i}=\max_{j\neq i}\hat{G}_{t-1,j}+Z_{j}$ and $\text{supp}(p_{t})=\{i\>:\>p_{t,i}>0\}$ .

Proof.

To reduce clutter, we drop the time subscripts: we use $\hat{G}$ to denote the cumulative estimate $\hat{G}_{t-1}$ , $\hat{g}$ to denote the marginal estimate $\hat{g}_{t}=\hat{G}_{t}-\hat{G}_{t-1}$ , $p$ to denote $p_{t}$ , and $g$ to denote the true gain $g_{t}$ . Note that by definition of Framework 1, $\hat{g}$ is a sparse vector with one non-zero and non-positive coordinate $\hat{g}_{i_{t}}=g_{i_{t}}/p_{i_{t}}=-\left|g_{i_{t}}/p_{i_{t}}\right|$ . Morever, conditioned on $i_{1:t-1}$ , $i_{t}$ takes value $i$ with probability $p_{i}$ . For any $i\in\text{supp}(p)$ , let

[TABLE]

so that $h_{i}^{\prime}(r)=-\nabla_{i}{\tilde{\Phi}}\left(\hat{G}-r\mathbf{e}_{i}\right)+\nabla_{i}{\tilde{\Phi}}\left(\hat{G}\right)$ and $h_{i}^{\prime\prime}(r)=\nabla^{2}_{ii}{\tilde{\Phi}}\left(\hat{G}-r\mathbf{e}_{i}\right)$ . Now we write:

[TABLE]

The second equality on the first line implicitly used the assumption that $g_{i}\leq 0$ , i.e, the “gains” are non-positive. The second equality on the second line used that $h_{i}(0)=0$ , and the equality on the fourth line used Equation (15). ∎

Note that each summand in the divergence penalty expression above involves an integral of the density function of the distribution $\mathcal{D}$ over an interval. The main idea to control the divergence penalty for a bounded support distribution is to truncate the interval at the end of the support. For points that are close to the end of the support, we bound the integral by the product of the bound on the density and the interval length. For points that are far from the end of the support, we bound the integral through the hazard rate as was done by Abernethy et al. (2015).

For a general continuous random variable $Z$ with bounded density, bounded support, we first shift it (which obviously does not change the distribution of the random action choice $i_{t}$ and hence the expected regret) and scale it so that the support is a subset of $[0,1]$ with $\sup\{z:F(z)=0\}=0$ and $\inf\{z:F(z)=1\}=1$ where $F$ denotes the CDF of $Z$ . A benefit of this normalization is that the expectation of the random variable becomes non-negative so the underestimation penalty is guaranteed to be non-positive. After scaling, we assume that the bound on the density is $L$ . We consider the perturbation $\eta Z$ where $\eta>0$ is a tuning parameter. Write $F_{\eta}(x)$ and $f_{\eta}(x)$ to denote the CDF and PDF of the scaled random variable $\eta Z$ respectively. If $F$ is strictly increasing, we know that $F^{-1}$ exists. If not, define $F^{-1}(y)=\inf\{z:F(z)=y\}$ . Elementary calculation gives the following useful facts:

[TABLE]

Theorem 3.2.

(Divergence Penalty Control, Bounded Support)* The divergence penalty in the GBPA regret bound using the scaled perturbation $\eta Z$ , where $Z$ is drawn from a bounded support distribution satisfying the conditions above, can be upper bounded, for any $\epsilon>0$ , by*

[TABLE]

Proof.

From Lemma 3.1, we have, with $\hat{G}_{-i}=\max_{j\neq i}\hat{G}_{t-1,j}+\eta Z_{j}$ ,

[TABLE]

We bound the two integrals above differently. For the first integral, we add the restriction $f_{\eta}(z)>0$ by intersecting the integral interval with the support of the function $f_{\eta}(z)$ , denoted as $I_{f_{\eta}(z)}$ so that $1-F_{\eta}(z)$ is not [math] on the interval to be integrated. Thus, we get,

[TABLE]

The first inequality holds because $f_{\eta}(z)\leq L/\eta$ and $(1-F_{\eta}(z))\geq\epsilon$ on the set of $z$ ’s over which we are integrating. The second inequality holds because on the set under consideration $1-F_{\eta}(z)\leq 1-F_{\eta}(\hat{G}_{-i}-\hat{G}_{t-1,i})$ and the measure of the set is at most $s$ .

For the second integral, we use the bound $f_{\eta}(z)\leq L/\eta$ again to get,

[TABLE]

Plugging (18) and (19) into (17), we can bound the divergence penalty by,

[TABLE]

The second to last inequality holds because $|g_{t,i}|\leq 1$ and the last inequality holds because the sum over $i$ is at most over all $N$ arms. ∎

The regret bound for the uniform distribution is now an easy corollary.

Corollary 3.3.

(Regret Bound for Uniform)* For GBPA run with a stochastically smoothed potential using an appropriately scaled $[0,1]$ uniform perturbation where $\eta=(NT)^{2/3}$ , the expected regret can be upper bounded by $3(NT)^{2/3}$ .*

Proof.

For $[0,1]$ uniform distribution, we have $L=1$ , $F^{-1}(1-\epsilon)=1-\epsilon$ so the divergence penalty is upper bounded by

[TABLE]

If we let $\epsilon=\frac{1}{\sqrt{2\eta}}$ , we can see that the divergence penalty is upper bounded by $NT\sqrt{\frac{2}{\eta}}$ . Together with the overestimation penalty which is trivially bounded by $\eta$ and a non-positive underestimation penalty, we see that the final regret bound is

[TABLE]

Setting $\eta=(NT)^{2/3}$ gives the desired result. ∎

For a general perturbation with bounded support and bounded density, the rate at which $1-F^{-1}(1-\epsilon)$ goes to [math] as $\epsilon\to 0$ can vary but we can always guarantee sublinear expected regret.

Corollary 3.4.

(Asymptotic Regret Bound for Bounded Support)* For stochastically smoothed GBPA using general continuous random variable $\eta Z$ where $Z$ has bounded density and bounded support contained in $[0,1]$ and $\eta=(NT)^{2/3}$ , the expected regret grows sublinearly, i.e.,*

[TABLE]

Proof.

For a general distribution, let $\epsilon=\frac{1}{\sqrt{\eta}}$ . Since the overestimation penalty is trivially bounded by $\eta$ and the underestimation penalty is non-positive, the expected regret can be upper bounded by

[TABLE]

Setting $\eta=(NT)^{2/3}$ we see that the expected regret can be upper bounded by

[TABLE]

Since

[TABLE]

we conclude that

[TABLE]

∎

4 Perturbations with Unbounded Support

Unlike perturbations with bounded support, perturbations with unbounded support (on the right) do have non-zero right tail probabilities, ensuring that $p_{t,i}>0$ always. However, the tail behavior may be such that the hazard rate is unbounded. Still, under mild assumptions, perturbations with unbounded support (on the right) can also be shown to have near optimal expected regret in $T$ , using the notion of generalized hazard rate that we now introduce.

4.1 Generalized Hazard Rate

We already know how to control the underestimation and overestimation penalties via Lemma 2.2. So our main focus will be to control the divergence penalty. Towards this end, we define the generalized hazard rate for a continuous random variable $Z$ with support unbounded on the right, parameterized by $\alpha\in[0,1)$ , as

[TABLE]

where $f(z)$ and $F(z)$ denotes the PDF and CDF of $Z$ respectively. Note that by setting $\alpha=0$ we recover the standard hazard rate.

One of the main results of this paper is the following. Note that it includes the result (Lemma 4.3) of Abernethy et al. (2015) as a special case.

Theorem 4.1.

(Divergence Penalty Control via Generalized Hazard Rate)* Let $\alpha\in[0,1)$ . Suppose we have $\forall z\in\mathbb{R},h_{\alpha}(z)\leq C$ . Then,*

[TABLE]

Proof.

Because of the unbounded support of $Z$ , $\text{supp}(p_{t})=\{1,\ldots,N\}$ . Lemma 3.1 gives us:

[TABLE]

Since the function $|z|^{-\alpha}$ is symmetric in $z$ , monotonically decreasing as $|z|\rightarrow\infty$ , we have

[TABLE]

Also, note that $z^{1-\alpha}$ is a concave function in $z$ . Hence, by Jensen’s inequality,

[TABLE]

Therefore,

[TABLE]

∎

A regret bound now easily follows.

Theorem 4.2.

(Regret Bound via Generalized Hazard Rate)* Suppose we use a stochastically smoothed GBPA with perturbation $\eta Z$ , with $Z$ ’s generalized hazard rate being bounded: $h_{\alpha}(x)\leq C,\forall x\in\mathbb{R}$ for some $\alpha\in[0,1)$ , and*

[TABLE]

where $Q(N)$ is some function of $N$ . Then, if we set $\eta=(\frac{2CNT}{(1-\alpha)Q(N)})^{1/(2-\alpha)}$ , the expected regret of GBPA is no greater than

[TABLE]

In particular, this implies that the algorithm has sublinear expected regret.

Proof.

The divergence penalty can be controlled through Theorem 4.1 once we have bounded generalized hazard rate. It remains to control the overestimation and underestimation penalty. By Lemma 2.2, they are at most $\mathbb{E}_{Z_{1},\dots,Z_{n}}[\max\limits_{i}Z_{i}]$ and $-\mathbb{E}[Z_{1}]$ respectively. Suppose we scale the perturbation $Z$ by $\eta>0$ , i.e., we add $\eta Z_{i}$ to each coordinate. It is easy to see that $\mathbb{E}[\max_{i=1,\ldots,n}\eta Z_{i}]=\eta\mathbb{E}[\max_{i=1,\ldots,n}Z_{i}]$ and $\mathbb{E}[\eta Z_{1}]=\eta\mathbb{E}[Z_{1}]$ . For the divergence penalty, observe that $F_{\eta}(t)=F(t/\eta)$ and thus $f_{\eta}(t)=\frac{1}{\eta}f(t/\eta)$ . Hence, the bound on the generalized hazard rate for perturbation $\eta Z$ is $\eta^{\alpha-1}C$ . Plugging new bounds for the scaled perturbations into Lemma 2.1 gives us

[TABLE]

Setting $\eta=(\frac{2CNT}{(1-\alpha)Q(N)})^{1/(2-\alpha)}$ finishes the proof. ∎

4.2 Gaussian Perturbation

In this section we prove that GBPA with the standard Gaussian perturbation incurs a near optimal expected regret in both $N$ and $T$ . Let $F(z)$ and $f(z)$ denote the CDF and PDF of standard Gaussian distribution.

Lemma 4.3 (Baricz (2008)).

For standard Gaussian random variable, we have

[TABLE]

This lemma together with example 2.6 in Thomas (1971) show that the hazard rate of a standard Gaussian random variable increases monotonically to infinity. However, we can still bound the generalized hazard rate for strictly positive $\alpha$ .

Lemma 4.4.

(Generalized Hazard Bound for Gaussian)* For any $\alpha\in(0,1)$ , we have*

[TABLE]

The proof of this lemma is deferred to the appendix.

The bounded generalized hazard rate shown in the above lemma can be used to control the divergence penalty. Combined with other knowledge of the standard Gaussian random variable we are able to give a bound on the expected regret.

Corollary 4.5.

The expected regret of GBPA with an appropriately scaled standard Gaussian random variable as perturbation where $\eta=\left(\frac{4NT}{\alpha(1-\alpha)\sqrt{2\log N}}\right)^{1/(2-\alpha)}$ has an expected regret at most

[TABLE]

where $C_{1}=\frac{2}{\alpha}$ , $C_{2}=\frac{2}{1-\alpha}$ , for any $\alpha\in(0,1)$ .

Proof.

It is known that for standard Gaussian random variable, we have $\mathbb{E}[Z_{1}]=0$ and

[TABLE]

Plug in to Theorem 4.2 gives the result. ∎

It remains to optimally tune $\alpha$ in the above bound.

Theorem 4.6.

(Regret Bound for Gaussian)* The expected regret of GBPA with an appropriately scaled standard Gaussian random variable as perturbation where $\eta=\left(\frac{4NT}{\alpha(1-\alpha)\sqrt{2\log N}}\right)^{1/(2-\alpha)}$ and $\alpha=\frac{1}{\log T}$ has an expected regret at most*

[TABLE]

for $T>4$ . If we assume that $T>N$ , the expected regret can be upper bounded by

[TABLE]

The proof of this theorem is also deferred to the appendix.

4.3 Sufficient Condition for Near Optimal Regret

In Section 4.1 we showed that if the generalized hazard rate of a distribution is bounded, the expected regret of the GBPA can be controlled. In this section, we are going to prove that under reasonable assumptions on the distribution of the perturbation, the FTPL enjoys near optimal expected regret. Note that most proofs in this section are deferred to the appendix.

Assumptions (a)-(c). Before we proceed, let us formally state our assumptions on the distributions we will consider. The distribution needs to (a) be continuous and have bounded density (b) have finite expectation (c) have support unbounded in the $+\infty$ direction.

Note that if the expectation of the random perturbation is negative, we shift it so that the expectation is zero. Hence the underestimation penalty is non-positive. In addition to the assumptions we have made above, we make another assumption on the eventual monotonicity of the hazard rate.

[TABLE]

“Eventually monotone” means that $\exists z_{0}\geq 0$ such that if $z>z_{0}$ , $\frac{f(z)}{1-F(z)}$ is non-decreasing or non-increasing. This assumption might appear hard to check, but numerous theorems are available to establish the monotonicity of hazard rate, which is much stronger than what we are assuming here. For example, see Theorem 2.4 in Thomas (1971), Theorem 2 and Theorem 4 in Chechile (2003), Chechile (2009). In fact, most natural distributions do satisfy this assumption (Bagnoli and Bergstrom, 2005).

Before we proceed, we mention a standard classification of random variables into two classes based on their tail property.

Definition 4.7 (see, for example, Foss et al. (2009)).

A function $f(z)\geq 0$ is said to be heavy-tailed if and only if

[TABLE]

A distribution with CDF $F(z)$ and $\overline{F}(z)=1-F(z)$ is said to be heavy-tailed if and only if $\overline{F}(z)$ is heavy-tailed. If the distribution is not heavy-tailed, we say that it is light-tailed.

It turns out that under assumptions (a)-(d), if the distribution is also heavy-tailed, then the hazard rate itself is bounded. If the distribution is light-tailed, we need an additional assumption on the eventual monotonicity of a function similar to the generalized hazard rate to ensure the boundedness of the generalized hazard rate. But before we state and prove the main results, we introduce some functions and prove an intermediate lemma that will be useful to prove the main results.

Define $R(z)=-\log\overline{F}(z)$ so that we have $\overline{F}(z)=e^{-R(x)}$ and $R^{\prime}(z)=\frac{f(z)}{\overline{F}(z)}=h_{0}(z)$ .

Lemma 4.8.

Under assumptions (a)-(d), we have

[TABLE]

Proof.

Let $g(z)=\overline{F}(z)e^{\lambda z}$ , then $g^{\prime}(z)=e^{\lambda z}\overline{F}(z)(\lambda-\frac{f(z)}{\overline{F}(z)})$ . Since $\frac{f(z)}{\overline{F}(z)}\text{ is eventually monotone}$ by assumption (d), $g^{\prime}(z)$ is eventually positive, negative or zero. The lemma immediately follows. ∎

We are finally ready to present the main results in this section.

Theorem 4.9.

(Heavy Tail Implies Bounded Hazard)* Under assumptions (a) - (d), if the distribution is also heavy-tailed, then the hazard rate is bounded, i.e,*

[TABLE]

Unlike heavy-tailed distributions, the hazard rate of light-tailed distributions might be unbounded. However, it turns out that if we make an additional assumption on the eventual monotonicity of a function similar to the generalized hazard rate, we can still guarantee the boundedness of the generalized hazard rate.

[TABLE]

Theorem 4.10.

(Light Tail Implies Bounded Generalized Hazard)* Under assumptions (a) - (e), if the distribution is also light-tailed, then for any $\alpha\in(\delta,1)$ , the generalized hazard rate $h_{\alpha}(z)$ is bounded, i.e,*

[TABLE]

Combining the above result with control of the divergence penalty gives us the following corollary.

Corollary 4.11.

Under assumptions (a)-(e), if the distribution is also light-tailed, the expected regret of GBPA with appropriately scaled perturbations drawn from that distribution is, for all $\alpha\in(\delta,1)$ and $\xi>0$ ,

[TABLE]

In particular, if assumption (e) holds for all $\delta\in(0,1)$ , then the expected regret of GBPA is $O\Big{(}(TN)^{1/2+\epsilon}\Big{)}$ for all $\epsilon>0$ , i.e, it is near optimal in both $N$ and $T$ .

Next we consider a family of light-tailed distributions that do not have a bounded hazard rate.

Definition 4.12.

The exponential power (or generalized normal) family of distributions, denoted as $\mathcal{D}_{\beta}$ where $\beta>1$ , is defined via the cdf

[TABLE]

The next theorem shows that GBPA with perturbations from this family of distributions enjoys near optimal expected regret in both $N$ and $T$ .

Theorem 4.13.

(Regret Bound for Exponential Power Family)* $\forall\beta>1$ , the expected regret of GBPA with appropriately sclaed perturbations drawn from $\mathcal{D}_{\beta}$ is, for all $\epsilon>0$ , $O\Big{(}(TN)^{1/2+\epsilon}\Big{)}$ .*

5 Conclusion and Future Work

Previous work on providing regret guarantees for FTPL algorithms in the adversarial multi-armed bandit setting required a bounded hazard rate condition. We have shown how to go beyond the hazard rate condition but a number of questions remain open. For example, what if we use FTPL with perturbations from discrete distributions such as Bernoulli distribution? In the full information setting Devroye et al. (2013) and Van Erven et al. (2014) have considered random walk perturbation and dropout perturbation, both leading to minimax optimal regret. But to the best of our knowledge those distributions have not been analyzed in the adversarial multi-armed bandit problem.

An unsatisfactory aspect of even the tightest bounds for FTPL algorithms from existing work, including ours, is that they never reach the minimax optimal $O(\sqrt{NT})$ bound. They come very close to it: up to logarithmic factors. It is known that FTRL algorithms, using the negative Tsallis entropy as the regularizer, can achieve the optimal bound (Audibert and Bubeck, 2009; Audibert et al., 2011; Abernethy et al., 2015). Is there a perturbation that can achieve the optimal bound?

We only considered multi-armed bandits in this work. There has been some interest in using FTPL algorithms for combinatorial bandit problems (see, for example, Neu and Bartók (2013)). In future work, it will be interesting to extend our analysis to combinatorial bandit problems.

Acknowledgments.

We thank Jacob Abernethy and Chansoo Lee for helpful discussions. We acknowledge the support of NSF CAREER grant IIS-1452099 and a Sloan Research Fellowship.

Appendix A Proofs

A.1 Proof of Lemma 4.4

Proof.

Since the numerator of the left hand side is an even function of $z$ , and the denominator is a decreasing function, and the inequality is trivially true when $z=0$ , it suffices to prove for $z>0$ , which we assume for the rest of the proof. From Lemma 4.3 we can derive that

[TABLE]

Therefore,

[TABLE]

Let $g(z)=ze^{-\alpha z^{2}/2}$ , $g^{\prime}(z)=(1-\alpha z^{2})e^{-\alpha z^{2}/2}$ . Therefore $g(z)$ is maximized at $z^{*}=\sqrt{\frac{1}{\alpha}}$ . Therefore,

[TABLE]

∎

A.2 Proof of Theorem 4.6

Proof.

From Corollary 4.5 we see that the expected regret can be upper bounded by

[TABLE]

where $C_{1}=\frac{2}{\alpha}$ and $C_{1}=\frac{2}{1-\alpha}$ . Note that

[TABLE]

∎

If we let $\alpha=\frac{1}{\log T}$ , then $T^{\alpha}=T^{1/\log T}=e<3$ . Then, we have, for $T>4$ ,

[TABLE]

Putting things together finishes the proof.

A.3 Proof of Theorem 4.9

Proof.

If the distribution is heavy-tailed, we have

[TABLE]

By Lemma 4.8, we can erase the supremum operator and just write

[TABLE]

Hence,

[TABLE]

Note that $R^{\prime}(z)=\frac{f(z)}{\overline{F}(z)}$ , which is eventually monotone by assumption. Therefore, we can conclude that

[TABLE]

∎

A.4 Proof of Theorem 4.10

Proof.

If the distribution is light-tailed, we have

[TABLE]

This immediately implies that

[TABLE]

Consider $\lim_{z\rightarrow\infty}\frac{f(z)}{\overline{F}(z)}=\lim_{z\rightarrow\infty}R^{\prime}(z)$ . If $\lim_{z\rightarrow\infty}R^{\prime}(z)<\infty$ we can immediately conclude that $\sup_{z}\frac{f(z)}{1-F(z)}<\infty$ . If $\lim_{z\rightarrow\infty}R^{\prime}(z)=\infty$ instead, note that

[TABLE]

Moreover, since $\lim_{z\rightarrow\infty}R^{\prime}(z)=\infty$ , $R^{\prime}(z)e^{-\delta R(z)}$ is strictly positive for all $z>z_{0}$ for some $z_{0}$ . Furthermore, $R^{\prime}(z)e^{-\delta R(z)}=\frac{f(z)}{(\overline{F}(z))^{1-\delta}}$ is eventually monotone by assumption (e),

Therefore, we can conclude that

[TABLE]

$\forall\alpha\in(\delta,1)$ , from Equation (22) we have $\lim_{z\rightarrow+\infty}z^{\alpha}\overline{F}(z)^{\alpha-\delta}=0$ , so

[TABLE]

and hence

[TABLE]

∎

A.5 Proof of Corollary 4.11

Proof.

For a light-tailed distribution $\mathcal{D}$ , we have

[TABLE]

This implies that

[TABLE]

Let random variable $Z$ follows distribution $\mathcal{D}$ . Since $Z$ might take negative values, we define a new distribution $\mathcal{D}^{\prime}$ that only takes non-negative value by

[TABLE]

where $p_{\mathcal{D}+}=\mathbb{P}(Z\geq 0)>0$ by right unbounded support assumption. Clearly, with this definition of $\mathcal{D}^{\prime}$ we see that $\mathbb{E}_{Z_{1},\dots,Z_{N}\sim\mathcal{D}}[\max\limits_{i}Z_{i}]\leq\mathbb{E}_{Z_{1},\dots,Z_{N}\sim\mathcal{D}^{\prime}}[\max\limits_{i}Z_{i}]$ and for $z>z_{0}$ , we have $\overline{F}_{\mathcal{D}^{\prime}}(z)=\frac{\overline{F}_{\mathcal{D}}(z)}{p_{\mathcal{D}+}}\leq C^{\prime}e^{-\lambda^{*}z}$ where $C^{\prime}=\frac{C}{p_{\mathcal{D}+}}$ . Note that

[TABLE]

If we let $u=\frac{\log(N)}{\lambda^{*}}$ , obviously $u>z_{0}$ if $N$ is sufficiently large. Thus, we see that

[TABLE]

From Theorem 4.10 we see that $\forall\alpha\in(\delta,1)$ ,

[TABLE]

Plug 23 and 24 into Theorem 4.2 gives the desired result.

∎

A.6 Proof of Corollary 4.13

Proof.

By Corollary 4.11 we only need to check that assumptions (a)-(d) hold for distribution $\mathcal{D}_{\beta}$ , exponential power family is light-tailed, and assumption (e) also holds for any $\delta\in(0,1)$ . By observing the density function $f_{\beta}$ we can trivially see that assumptions (a)-(c) hold and that the exponential power family is light-tailed. Therefore, define

[TABLE]

it suffices to show that $\forall\delta\in[0,1),g_{\delta,\beta}(z)$ is eventually monotone. Note that

[TABLE]

It further suffices to show that

[TABLE]

is eventually non-negative or non-positive $\forall\beta>1,\delta\in[0,1)$ . Note that since $\beta>1$ ,

[TABLE]

Therefore, $m_{0,\beta}(z)>0$ for all $z\geq 0$ , i.e, the hazard rate is always increasing and assumption (d) is satisfied. Now, we are left to show that $m_{\delta,\beta}(z)$ is eventually non-negative or non-positive for any $\delta\in(0,1)$ . Note that

[TABLE]

Therefore,

[TABLE]

From Equation (25) we know that

[TABLE]

Hence, we conclude that

[TABLE]

which implies that $m_{\delta,\beta}(z)$ is eventually non-positive for any $\delta\in(0,1)$ , i.e, assumption (e) holds for any $\delta\in(0,1)$ .

∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abernethy et al. [2014] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization via smoothing. In COLT , pages 807–823, 2014.
2Abernethy et al. [2015] Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems 28 , pages 2188–2196, 2015.
3Audibert and Bubeck [2009] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT , pages 217–226, 2009.
4Audibert et al. [2011] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Minimax policies for combinatorial prediction games. In COLT , 2011.
5Auer et al. [2002] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-stochastic multi-armed bandit problem. SIAM J. Comput. , 32:48–77, 2002.
6Bagnoli and Bergstrom [2005] Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic Theory , 26(2):445–469, 2005.
7Baricz [2008] Árpád Baricz. Mills’ ratio: Monotonicity patterns and functional inequalities. J. Math. Anal. Appl. , 340(2):1362–1370, 2008.
8Bertsekas [1973] Dimitri P. Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals. Journal of Optimization Theory and Applications , 12(2):218–231, 1973. ISSN 0022-3239.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits

Abstract

1 Introduction

Conjecture 1**.**

2 Follow the Perturbed Leader Algorithm for Bandits

2.1 The Gradient-Based Algorithmic Template

Lemma 2.1**.**

Proof.

2.2 Stochastic Smoothing of Potential Function

Lemma 2.2**.**

Proof.

2.3 Connection to Follow the Perturbed Leader

2.4 The Role of the Hazard Rate and Its Limitation

3 Perturbations with Bounded Support

Lemma 3.1**.**

Proof.

Theorem 3.2**.**

Proof.

Corollary 3.3**.**

Proof.

Corollary 3.4**.**

Proof.

4 Perturbations with Unbounded Support

4.1 Generalized Hazard Rate

Theorem 4.1**.**

Proof.

Theorem 4.2**.**

Proof.

4.2 Gaussian Perturbation

Lemma 4.3** (Baricz (2008)).**

Lemma 4.4**.**

Corollary 4.5**.**

Proof.

Theorem 4.6**.**

4.3 Sufficient Condition for Near Optimal Regret

Definition 4.7** (see, for example, Foss et al. (2009)).**

Lemma 4.8**.**

Proof.

Theorem 4.9**.**

Theorem 4.10**.**

Corollary 4.11**.**

Definition 4.12**.**

Theorem 4.13**.**

5 Conclusion and Future Work

Acknowledgments.

Appendix A Proofs

A.1 Proof of Lemma 4.4

Proof.

A.2 Proof of Theorem 4.6

Proof.

A.3 Proof of Theorem 4.9

Proof.

A.4 Proof of Theorem 4.10

Proof.

A.5 Proof of Corollary 4.11

Proof.

A.6 Proof of Corollary 4.13

Proof.

Conjecture 1.

Lemma 2.1.

Lemma 2.2.

Lemma 3.1.

Theorem 3.2.

Corollary 3.3.

Corollary 3.4.

Theorem 4.1.

Theorem 4.2.

Lemma 4.3 (Baricz (2008)).

Lemma 4.4.

Corollary 4.5.

Theorem 4.6.

Definition 4.7 (see, for example, Foss et al. (2009)).

Lemma 4.8.

Theorem 4.9.

Theorem 4.10.

Corollary 4.11.

Definition 4.12.

Theorem 4.13.