Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits
Zifan Li, Ambuj Tewari

TL;DR
This paper extends the analysis of follow the perturbed leader algorithms for adversarial multi-armed bandits by introducing the generalized hazard rate, allowing for regret bounds with distributions like Gaussian and uniform.
Contribution
It introduces the generalized hazard rate concept and provides regret bounds for FTPL algorithms without the bounded hazard rate assumption, including for Gaussian and uniform distributions.
Findings
Gaussian distribution can achieve near-optimal regret.
Regret bounds are established for distributions with unbounded support.
Disproves the conjecture that Gaussian cannot be used for low-regret algorithms.
Abstract
Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret bounds for FTPL algorithms beyond the bounded hazard rate condition. There are good reasons to do so: natural distributions such as the uniform and Gaussian violate the condition. We give regret bounds for both bounded support and unbounded support distributions without assuming the hazard rate condition. We also disprove a conjecture that the Gaussian distribution cannot lead to a low-regret algorithm. In fact, it turns out that it leads to near optimal regret, up to logarithmic factors. A key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Reinforcement Learning in Robotics
Beyond the Hazard Rate: More Perturbation Algorithms for Adversarial Multi-armed Bandits
Zifan Li
University of Michigan
Ambuj Tewari
University of Michigan
Abstract
Recent work on follow the perturbed leader (FTPL) algorithms for the adversarial multi-armed bandit problem has highlighted the role of the hazard rate of the distribution generating the perturbations. Assuming that the hazard rate is bounded, it is possible to provide regret analyses for a variety of FTPL algorithms for the multi-armed bandit problem. This paper pushes the inquiry into regret bounds for FTPL algorithms beyond the bounded hazard rate condition. There are good reasons to do so: natural distributions such as the uniform and Gaussian violate the condition. We give regret bounds for both bounded support and unbounded support distributions without assuming the hazard rate condition. We also disprove a conjecture that the Gaussian distribution cannot lead to a low-regret algorithm. In fact, it turns out that it leads to near optimal regret, up to logarithmic factors. A key ingredient in our approach is the introduction of a new notion called the generalized hazard rate.
Keywords: online learning, regret, multi-armed bandits, follow the perturbed leader, gradient based algorithms
1 Introduction
Starting from the seminal work of Hannan (1957) and later developments due to Kalai and Vempala (2005), perturbation based algorithms (called “Follow the Perturbed Leader (FTPL)”) have occupied a central place in online learning. Another major family of online learning algorithms, called “Follow the Regularized Leader (FTRL)”, is based on the idea of regularization. In special cases, such as the exponential weights algorithm for the experts problem, it has been folk knowledge that regularization and perturbation ideas are connected. That is, the exponential weights algorithm can be understood as either using negative entropy regularization or Gumbel distributed perturbations (for example, see the discussion in Abernethy et al. (2014)).
Recent work have begun to further uncover the connections between perturbation and regularization. For example, in online linear optimization, one can understand regularization and perturbation as simply two different ways to smooth a non-smooth potential function. The former corresponds to infimal convolution smoothing and the latter corresponds to stochastic (or integral convolution) smoothing (Abernethy et al., 2014). Having a generic framework for understanding perturbations allows one to study a wide variety of online linear optimization games and a number of interesting perturbations.
FTRL and FTPL algorithms have also been used beyond “full information” settings. “Full information” refers to the fact that the learner observes the entire move of the adversary. The multi-armed bandit problem is one of the most fundamental examples of “partial information” settings. Regret analysis of the multi-armed bandit problem goes back to the work of Robbins (1952) who formulated the stochastic version of the problem. The non-stochastic, or adversarial, version was formulated by Auer et al. (2002), who provided the EXP3 algorithm achieving regret in rounds with arms. They also showed a lower bound of , which was later matched by the Poly-INF algorithm (Audibert and Bubeck, 2009; Audibert et al., 2011). The Poly-INF algorithm can be interpreted as an FTRL algorithm with negative Tsallis entropy regularization (Audibert et al., 2011; Abernethy et al., 2015). For a recent survey of both stochastic and non-stochastic bandit problems, see Bubeck and Cesa-Bianchi (2012).
For the non-stochastic multi-armed bandit problem, Kujala and Elomaa (2005) and Poland (2005) both showed that using the exponential (actually double exponential/Laplace) distribution in an FTPL algorithm coupled with standard unbiased estimation technique yields near-optimal regret. Unbiased estimation needs access to arm probabilities that are not explicitly available when using an FTPL algorithm. Neu and Bartók (2013) introduced the geometric resampling scheme to approximate these probabilities while still guaranteeing low regret. Recently, Abernethy et al. (2015) analyzed FTPL for adversarial multi-armed bandits and provided regret bounds under the condition that the hazard rate of the perturbation distribution is bounded. This condition allowed them to consider a variety of perturbation distributions beyond the exponential, such as Gamma, Gumbel, Frechet, Pareto, and Weibull.
Unfortunately, the bounded hazard rate condition is violated by two of the most widely known distributions: namely the uniform111The uniform distribution is also historically significant as it was used in the original FTPL algorithm of Hannan (1957). and the Gaussian distributions. Therefore, the results of Abernethy et al. (2015) say nothing about the regret incurred in an adversarial multi-armed bandit problem when we use these distributions (without forced exploration) to generate perturbations. Contrast this to the full information experts setting where using these distributions as perturbations yields optimal regret and even yields the optimal dependence on the dimension in the Gaussian case (Abernethy et al., 2014).
The Gaussian distribution has lighter tails than the exponential. The hazard rate of a Gaussian increases linearly on the real line (and is hence unbounded) whereas the exponential has a constant hazard rate. Does having too light a tail make a perturbation inherently bad? The uniform is even worse from a light tail point of view: it has bounded support! In fact, Kujala and Elomaa (2005) had trouble dealing with the uniform distribution and remarked, “we failed to analyze the expert setting when the perturbation distribution was uniform.” Does having a bounded support make a perturbation even worse? Or is it that the hazard rate condition is just a sufficient condition without being anywhere close to necessary for a good regret bound to exist. The analysis of Abernethy et al. (2015) suggests that perhaps a bounded hazard rate is critical. They even made the following conjecture.
Conjecture 1**.**
If a distribution has a monotonically increasing hazard rate that does not converge as (e.g., Gaussian), then there is a sequence of gains that causes the corresponding FTPL algorithm to incur at least a linear regret.
The main contribution of this paper is to provide answers to the questions raised above. First, we show that boundedness of the hazard rate is certainly not a requirement for achieving sublinear (in ) regret. Bounded support distributions, like the uniform, violate the boundedness condition on the hazard rate in the most extreme way. Their hazard rate blows up not just asymptotically at infinity, as in the Gaussian case, but as one approaches the right edge of the support. Yet, we can show (Corollary 3.3) that using the uniform distribution results in a regret bound of . This bound is clearly not optimal. But optimality is not the point here. What is surprising, especially if one regards Conjecture 1 as plausible, is that a non-trivial sublinear bound holds at all. In fact, we show (Corollary 3.4) that using any continuous distribution with bounded support and bounded density results in a sublinear regret bound.
Second, moving beyond bounded support distributions to ones with unbounded support, we settle Conjecture 1 in the negative. In Theorem 4.6 we show that, instead of suffering linear regret as predicted by Conjecture 1, a perturbation algorithm using the Gaussian distribution enjoys a near optimal regret bound of . A key ingredient in our approach is a new quantity that we call the generalized hazard rate of a distribution. We show that bounded generalized hazard rate is enough to guarantee sublinear regret in (Theorem 4.2).
Finally, we investigate the relationship between tail behavior of random perturbations and the regret they induce. We show that heavy tails, along with some fairly mild assumptions, guarantee a bounded hazard rate (Theorem 4.9) and hence previous results can yield regret bounds for these perturbations. However, light tails can fail to have a bounded hazard rate. Nevertheless, we show that under reasonable conditions, light tailed distributions do have a bounded generalized hazard rate (Theorem 4.10). This result allows us to show that reasonably behaved light-tailed distributions lead to near optimal regret (Corollary 4.11). In particular, the exponential power (or generalized normal) family of distributions yields near optimal regret (Theorem 4.13)
2 Follow the Perturbed Leader Algorithm for Bandits
Recall the setting of the adversarial multi-armed bandit problem (Auer et al., 2002). An adversary (or Nature) chooses gain vectors for ahead of the game. Such an adversary is called oblivious. At round in a repeated game, the learner must choose a distribution over the set of available arms (or actions). The learner plays action sampled according to and accumulates the gains . The learner observes only and receives no information about the values for .
The learner’s goal is to minimize the regret. Regret is defined to be the difference in the realized gains and the gains of the best fixed action in hindsight:
[TABLE]
To be precise, we consider the expected regret, where the expectation is taken with respect to the learner’s randomization. Note that, under an oblivious adversary, the only random variables in the above expression are the actions of the learner. For convenience, define the cumulative gain vectors by
[TABLE]
2.1 The Gradient-Based Algorithmic Template
We will consider the algorithmic template described in Framework 1, which is the Gradient Based Prediction Algorithm (GBPA) (see, for example, Abernethy et al. (2015)). Let be the -dimensional probability simplex in . Denote the standard basis vector along the th dimension by . At any round , the action choice is made by sampling from the distribution which is obtained by applying the gradient of a convex function to the estimate of the cumulative gain vector so far. The choice of is flexible but it must be a differentiable convex function such that its gradient is always in .
Note that we do not require the range of be contained in the interior of the probability simplex. If we required the gradient to lie in the interior, we would not be able to deal with bounded support distributions such as the uniform distribution. Even though some entries of the probability vector might be [math], the estimation step is always well defined since . But allowing to be zero means that is not exactly an unbiased estimator of . Instead, it is an unbiased estimator on the support of . That is, for any such that . Here, is shorthand for . Therefore, irrespective of whether or not, we always have
[TABLE]
When , we have but , which means that overestimates outside the support of . Hence, we also have
[TABLE]
where means element-wise greater than.
We now present a basic result bounding the expected regret of GBPA in the multi-armed bandit setting. It is basically just a simple modification of the arguments in Abernethy et al. (2015) to deal with the possibility that . We state and prove this result here for completeness without making any claim of novelty.
Lemma 2.1**.**
(Decomposition of the Expected Regret)* Define the non-smooth potential . The expected regret of GBPA can be written as*
[TABLE]
Furthermore, the expected regret of GBPA can be bounded by the sum of an overestimation, an underestimation, and a divergence penalty:
[TABLE]
where the expectations are over the sampling of and is the Bregman divergence induced by .
Proof.
First, note that the regret, by definition, is
[TABLE]
Under an oblivious adversary, only the summation on the right hand side is random. Moreover . This proves the claim in (4).
From (2), we know that even if some entries in might be zero. Therefore, we have
[TABLE]
From (3), we know that . This implies
[TABLE]
where the first inequality is because , and the second inequality is due to the convexity of . Plugging (7) into (6) yields
[TABLE]
Now, recalling the definition of Bregman divergence
[TABLE]
we can write,
[TABLE]
The proof ends by plugging (10) into (8) and noting that is not random. ∎
2.2 Stochastic Smoothing of Potential Function
Let be a continuous distribution with finite expectation, probability density function , and cumulative distribution function . Consider GBPA with potential function of the form:
[TABLE]
which is a stochastic smoothing of the non-smooth function . Note that . We will often hide the dependence on the distribution if the distribution is obvious from the context or when the dependence on is not of importance in the argument. Since is convex, is also convex. For stochastic smoothing, we have the following result to control the underestimation and overestimation penalty.
Lemma 2.2**.**
For any , we have
[TABLE]
where is any function such that
[TABLE]
In particular, this implies that the overestimation penalty is upper bounded by and the underestimation penalty is upper bounded by .
Proof.
We have,
[TABLE]
Noting that finishes the proof. ∎
Observe that as a function of is differentiable with probability (under the randomness of the ’s) due to the fact that ’s are random variables with a density. By Proposition 2.3 of Bertsekas (1973), we can swap the order of differentiation and expectation:
[TABLE]
Note that, for any , the random index is unique with probability . Hence, ties between arms can be resolved arbitrarily. It is clear from above that , being an expectation of vectors in the probability simplex, is in the probability simplex. Thus, it is a valid potential to be used in Framework 1. Now we derive an identity to write the gradient of the smoothed potential function in terms of the expectation of the cumulative distribution function,
[TABLE]
where . If has unbounded support then this partial derivative is non-zero for all given any . However, it can be zero if has bounded support. Similarly, we have the following useful identity that writes the diagonal of the Hessian of the smoothed potential function in terms of the expectation of the probability density function.
[TABLE]
2.3 Connection to Follow the Perturbed Leader
The sampling step of Framework 1 with a stochastically smoothed as the potential (Equation 11) can be done efficiently. Instead of evaluating the expectation (Equation 13), we just take a random sample. Doing so gives us an equivalent of Follow the Perturbed Leader Algorithm (FTPL) (Kalai and Vempala, 2005) applied to the bandit setting. On the other hand, the estimation step is hard because generally there is no closed-form expression for .
To address this issue, Neu and Bartók (2013) proposed Geometric Resampling (GR), an iterative resampling process to estimate (with bias). They showed that the extra regret after stopping at iterations of GR introduces an estimation bias that is at most as an additive term. That is, all GBPA regret bounds that we prove will hold for the corresponding FTPL algorithm that does iterations of GR at every time step, with an extra additive term. This extra term does not affect the regret rate as long as , because the lower bound for any adversarial multi-armed bandit algorithm is of the order .
2.4 The Role of the Hazard Rate and Its Limitation
In previous work, Abernethy et al. (2015) proved that for a continuous random variable with finite and nonnegative expectation and support on the whole real line , if the hazard rate of the random variable is bounded, i.e,
[TABLE]
then the expected regret of GBPA can be upper bounded as
[TABLE]
Common families of distributions whose regret can be controlled in this way include Gumbel, Frechet, Weibull, Pareto, and Gamma (see Abernethy et al. (2015) for details). However, there are many other families of distributions where the hazard rate condition fails. For example, if the random variable has a bounded support, then the hazard rate would certainly explode at the end of the support. This is, in some sense, an extreme case of violation because the random variable does not even have a tail. There are also some random variables that do have support on but have unbounded hazard rate, e.g. Gaussian, where the hazard rate monotonically increases to infinity. How can we perform analyses of the expected regret of GBPA using those random variables as perturbations? To address these issues, we need to go beyond the hazard rate.
3 Perturbations with Bounded Support
In this section, we prove that GBPA with any continuous distribution that has bounded support and bounded density enjoys sublinear expected regret. From Lemma 2.1 we see that the expected regret can be upper bounded by the sum of three terms. The overestimation penalty can be bounded very easily via Lemma 2.2 for a distribution with bounded support. The underestimation penalty is non-positive as long as the distribution has non-negative expectation. The only term that needs to be controlled with some effort is the divergence penalty.
We first present a general lemma that allows us to write the divergence penalty for a stochastically smoothed potential as a sum involving certain double integrals.
Lemma 3.1**.**
When using a stochastically smoothed potential as in (11), the divergence penalty can be written as
[TABLE]
where , and .
Proof.
To reduce clutter, we drop the time subscripts: we use to denote the cumulative estimate , to denote the marginal estimate , to denote , and to denote the true gain . Note that by definition of Framework 1, is a sparse vector with one non-zero and non-positive coordinate . Morever, conditioned on , takes value with probability . For any , let
[TABLE]
so that and . Now we write:
[TABLE]
The second equality on the first line implicitly used the assumption that , i.e, the “gains” are non-positive. The second equality on the second line used that , and the equality on the fourth line used Equation (15). ∎
Note that each summand in the divergence penalty expression above involves an integral of the density function of the distribution over an interval. The main idea to control the divergence penalty for a bounded support distribution is to truncate the interval at the end of the support. For points that are close to the end of the support, we bound the integral by the product of the bound on the density and the interval length. For points that are far from the end of the support, we bound the integral through the hazard rate as was done by Abernethy et al. (2015).
For a general continuous random variable with bounded density, bounded support, we first shift it (which obviously does not change the distribution of the random action choice and hence the expected regret) and scale it so that the support is a subset of with and where denotes the CDF of . A benefit of this normalization is that the expectation of the random variable becomes non-negative so the underestimation penalty is guaranteed to be non-positive. After scaling, we assume that the bound on the density is . We consider the perturbation where is a tuning parameter. Write and to denote the CDF and PDF of the scaled random variable respectively. If is strictly increasing, we know that exists. If not, define . Elementary calculation gives the following useful facts:
[TABLE]
Theorem 3.2**.**
(Divergence Penalty Control, Bounded Support)* The divergence penalty in the GBPA regret bound using the scaled perturbation , where is drawn from a bounded support distribution satisfying the conditions above, can be upper bounded, for any , by*
[TABLE]
Proof.
From Lemma 3.1, we have, with ,
[TABLE]
We bound the two integrals above differently. For the first integral, we add the restriction by intersecting the integral interval with the support of the function , denoted as so that is not [math] on the interval to be integrated. Thus, we get,
[TABLE]
The first inequality holds because and on the set of ’s over which we are integrating. The second inequality holds because on the set under consideration and the measure of the set is at most .
For the second integral, we use the bound again to get,
[TABLE]
Plugging (18) and (19) into (17), we can bound the divergence penalty by,
[TABLE]
The second to last inequality holds because and the last inequality holds because the sum over is at most over all arms. ∎
The regret bound for the uniform distribution is now an easy corollary.
Corollary 3.3**.**
(Regret Bound for Uniform)* For GBPA run with a stochastically smoothed potential using an appropriately scaled uniform perturbation where , the expected regret can be upper bounded by .*
Proof.
For uniform distribution, we have , so the divergence penalty is upper bounded by
[TABLE]
If we let , we can see that the divergence penalty is upper bounded by . Together with the overestimation penalty which is trivially bounded by and a non-positive underestimation penalty, we see that the final regret bound is
[TABLE]
Setting gives the desired result. ∎
For a general perturbation with bounded support and bounded density, the rate at which goes to [math] as can vary but we can always guarantee sublinear expected regret.
Corollary 3.4**.**
(Asymptotic Regret Bound for Bounded Support)* For stochastically smoothed GBPA using general continuous random variable where has bounded density and bounded support contained in and , the expected regret grows sublinearly, i.e.,*
[TABLE]
Proof.
For a general distribution, let . Since the overestimation penalty is trivially bounded by and the underestimation penalty is non-positive, the expected regret can be upper bounded by
[TABLE]
Setting we see that the expected regret can be upper bounded by
[TABLE]
Since
[TABLE]
we conclude that
[TABLE]
∎
4 Perturbations with Unbounded Support
Unlike perturbations with bounded support, perturbations with unbounded support (on the right) do have non-zero right tail probabilities, ensuring that always. However, the tail behavior may be such that the hazard rate is unbounded. Still, under mild assumptions, perturbations with unbounded support (on the right) can also be shown to have near optimal expected regret in , using the notion of generalized hazard rate that we now introduce.
4.1 Generalized Hazard Rate
We already know how to control the underestimation and overestimation penalties via Lemma 2.2. So our main focus will be to control the divergence penalty. Towards this end, we define the generalized hazard rate for a continuous random variable with support unbounded on the right, parameterized by , as
[TABLE]
where and denotes the PDF and CDF of respectively. Note that by setting we recover the standard hazard rate.
One of the main results of this paper is the following. Note that it includes the result (Lemma 4.3) of Abernethy et al. (2015) as a special case.
Theorem 4.1**.**
(Divergence Penalty Control via Generalized Hazard Rate)* Let . Suppose we have . Then,*
[TABLE]
Proof.
Because of the unbounded support of , . Lemma 3.1 gives us:
[TABLE]
Since the function is symmetric in , monotonically decreasing as , we have
[TABLE]
Also, note that is a concave function in . Hence, by Jensen’s inequality,
[TABLE]
Therefore,
[TABLE]
∎
A regret bound now easily follows.
Theorem 4.2**.**
(Regret Bound via Generalized Hazard Rate)* Suppose we use a stochastically smoothed GBPA with perturbation , with ’s generalized hazard rate being bounded: for some , and*
[TABLE]
where is some function of . Then, if we set , the expected regret of GBPA is no greater than
[TABLE]
In particular, this implies that the algorithm has sublinear expected regret.
Proof.
The divergence penalty can be controlled through Theorem 4.1 once we have bounded generalized hazard rate. It remains to control the overestimation and underestimation penalty. By Lemma 2.2, they are at most and respectively. Suppose we scale the perturbation by , i.e., we add to each coordinate. It is easy to see that and . For the divergence penalty, observe that and thus . Hence, the bound on the generalized hazard rate for perturbation is . Plugging new bounds for the scaled perturbations into Lemma 2.1 gives us
[TABLE]
Setting finishes the proof. ∎
4.2 Gaussian Perturbation
In this section we prove that GBPA with the standard Gaussian perturbation incurs a near optimal expected regret in both and . Let and denote the CDF and PDF of standard Gaussian distribution.
Lemma 4.3** (Baricz (2008)).**
For standard Gaussian random variable, we have
[TABLE]
This lemma together with example 2.6 in Thomas (1971) show that the hazard rate of a standard Gaussian random variable increases monotonically to infinity. However, we can still bound the generalized hazard rate for strictly positive .
Lemma 4.4**.**
(Generalized Hazard Bound for Gaussian)* For any , we have*
[TABLE]
The proof of this lemma is deferred to the appendix.
The bounded generalized hazard rate shown in the above lemma can be used to control the divergence penalty. Combined with other knowledge of the standard Gaussian random variable we are able to give a bound on the expected regret.
Corollary 4.5**.**
The expected regret of GBPA with an appropriately scaled standard Gaussian random variable as perturbation where has an expected regret at most
[TABLE]
where , , for any .
Proof.
It is known that for standard Gaussian random variable, we have and
[TABLE]
Plug in to Theorem 4.2 gives the result. ∎
It remains to optimally tune in the above bound.
Theorem 4.6**.**
(Regret Bound for Gaussian)* The expected regret of GBPA with an appropriately scaled standard Gaussian random variable as perturbation where and has an expected regret at most*
[TABLE]
for . If we assume that , the expected regret can be upper bounded by
[TABLE]
The proof of this theorem is also deferred to the appendix.
4.3 Sufficient Condition for Near Optimal Regret
In Section 4.1 we showed that if the generalized hazard rate of a distribution is bounded, the expected regret of the GBPA can be controlled. In this section, we are going to prove that under reasonable assumptions on the distribution of the perturbation, the FTPL enjoys near optimal expected regret. Note that most proofs in this section are deferred to the appendix.
Assumptions (a)-(c). Before we proceed, let us formally state our assumptions on the distributions we will consider. The distribution needs to (a) be continuous and have bounded density (b) have finite expectation (c) have support unbounded in the direction.
Note that if the expectation of the random perturbation is negative, we shift it so that the expectation is zero. Hence the underestimation penalty is non-positive. In addition to the assumptions we have made above, we make another assumption on the eventual monotonicity of the hazard rate.
[TABLE]
“Eventually monotone” means that such that if , is non-decreasing or non-increasing. This assumption might appear hard to check, but numerous theorems are available to establish the monotonicity of hazard rate, which is much stronger than what we are assuming here. For example, see Theorem 2.4 in Thomas (1971), Theorem 2 and Theorem 4 in Chechile (2003), Chechile (2009). In fact, most natural distributions do satisfy this assumption (Bagnoli and Bergstrom, 2005).
Before we proceed, we mention a standard classification of random variables into two classes based on their tail property.
Definition 4.7** (see, for example, Foss et al. (2009)).**
A function is said to be heavy-tailed if and only if
[TABLE]
A distribution with CDF and is said to be heavy-tailed if and only if is heavy-tailed. If the distribution is not heavy-tailed, we say that it is light-tailed.
It turns out that under assumptions (a)-(d), if the distribution is also heavy-tailed, then the hazard rate itself is bounded. If the distribution is light-tailed, we need an additional assumption on the eventual monotonicity of a function similar to the generalized hazard rate to ensure the boundedness of the generalized hazard rate. But before we state and prove the main results, we introduce some functions and prove an intermediate lemma that will be useful to prove the main results.
Define so that we have and .
Lemma 4.8**.**
Under assumptions (a)-(d), we have
[TABLE]
Proof.
Let , then . Since by assumption (d), is eventually positive, negative or zero. The lemma immediately follows. ∎
We are finally ready to present the main results in this section.
Theorem 4.9**.**
(Heavy Tail Implies Bounded Hazard)* Under assumptions (a) - (d), if the distribution is also heavy-tailed, then the hazard rate is bounded, i.e,*
[TABLE]
Unlike heavy-tailed distributions, the hazard rate of light-tailed distributions might be unbounded. However, it turns out that if we make an additional assumption on the eventual monotonicity of a function similar to the generalized hazard rate, we can still guarantee the boundedness of the generalized hazard rate.
[TABLE]
Theorem 4.10**.**
(Light Tail Implies Bounded Generalized Hazard)* Under assumptions (a) - (e), if the distribution is also light-tailed, then for any , the generalized hazard rate is bounded, i.e,*
[TABLE]
Combining the above result with control of the divergence penalty gives us the following corollary.
Corollary 4.11**.**
Under assumptions (a)-(e), if the distribution is also light-tailed, the expected regret of GBPA with appropriately scaled perturbations drawn from that distribution is, for all and ,
[TABLE]
In particular, if assumption (e) holds for all , then the expected regret of GBPA is O\Big{(}(TN)^{1/2+\epsilon}\Big{)} for all , i.e, it is near optimal in both and .
Next we consider a family of light-tailed distributions that do not have a bounded hazard rate.
Definition 4.12**.**
The exponential power (or generalized normal) family of distributions, denoted as where , is defined via the cdf
[TABLE]
The next theorem shows that GBPA with perturbations from this family of distributions enjoys near optimal expected regret in both and .
Theorem 4.13**.**
(Regret Bound for Exponential Power Family)* , the expected regret of GBPA with appropriately sclaed perturbations drawn from is, for all , O\Big{(}(TN)^{1/2+\epsilon}\Big{)}.*
5 Conclusion and Future Work
Previous work on providing regret guarantees for FTPL algorithms in the adversarial multi-armed bandit setting required a bounded hazard rate condition. We have shown how to go beyond the hazard rate condition but a number of questions remain open. For example, what if we use FTPL with perturbations from discrete distributions such as Bernoulli distribution? In the full information setting Devroye et al. (2013) and Van Erven et al. (2014) have considered random walk perturbation and dropout perturbation, both leading to minimax optimal regret. But to the best of our knowledge those distributions have not been analyzed in the adversarial multi-armed bandit problem.
An unsatisfactory aspect of even the tightest bounds for FTPL algorithms from existing work, including ours, is that they never reach the minimax optimal bound. They come very close to it: up to logarithmic factors. It is known that FTRL algorithms, using the negative Tsallis entropy as the regularizer, can achieve the optimal bound (Audibert and Bubeck, 2009; Audibert et al., 2011; Abernethy et al., 2015). Is there a perturbation that can achieve the optimal bound?
We only considered multi-armed bandits in this work. There has been some interest in using FTPL algorithms for combinatorial bandit problems (see, for example, Neu and Bartók (2013)). In future work, it will be interesting to extend our analysis to combinatorial bandit problems.
Acknowledgments.
We thank Jacob Abernethy and Chansoo Lee for helpful discussions. We acknowledge the support of NSF CAREER grant IIS-1452099 and a Sloan Research Fellowship.
Appendix A Proofs
A.1 Proof of Lemma 4.4
Proof.
Since the numerator of the left hand side is an even function of , and the denominator is a decreasing function, and the inequality is trivially true when , it suffices to prove for , which we assume for the rest of the proof. From Lemma 4.3 we can derive that
[TABLE]
Therefore,
[TABLE]
Let , . Therefore is maximized at . Therefore,
[TABLE]
∎
A.2 Proof of Theorem 4.6
Proof.
From Corollary 4.5 we see that the expected regret can be upper bounded by
[TABLE]
where and . Note that
[TABLE]
∎
If we let , then . Then, we have, for ,
[TABLE]
Putting things together finishes the proof.
A.3 Proof of Theorem 4.9
Proof.
If the distribution is heavy-tailed, we have
[TABLE]
By Lemma 4.8, we can erase the supremum operator and just write
[TABLE]
Hence,
[TABLE]
Note that , which is eventually monotone by assumption. Therefore, we can conclude that
[TABLE]
∎
A.4 Proof of Theorem 4.10
Proof.
If the distribution is light-tailed, we have
[TABLE]
This immediately implies that
[TABLE]
Consider . If we can immediately conclude that . If instead, note that
[TABLE]
Moreover, since , is strictly positive for all for some . Furthermore, is eventually monotone by assumption (e),
Therefore, we can conclude that
[TABLE]
, from Equation (22) we have , so
[TABLE]
and hence
[TABLE]
∎
A.5 Proof of Corollary 4.11
Proof.
For a light-tailed distribution , we have
[TABLE]
This implies that
[TABLE]
Let random variable follows distribution . Since might take negative values, we define a new distribution that only takes non-negative value by
[TABLE]
where by right unbounded support assumption. Clearly, with this definition of we see that and for , we have where . Note that
[TABLE]
If we let , obviously if is sufficiently large. Thus, we see that
[TABLE]
From Theorem 4.10 we see that ,
[TABLE]
Plug 23 and 24 into Theorem 4.2 gives the desired result.
∎
A.6 Proof of Corollary 4.13
Proof.
By Corollary 4.11 we only need to check that assumptions (a)-(d) hold for distribution , exponential power family is light-tailed, and assumption (e) also holds for any . By observing the density function we can trivially see that assumptions (a)-(c) hold and that the exponential power family is light-tailed. Therefore, define
[TABLE]
it suffices to show that is eventually monotone. Note that
[TABLE]
It further suffices to show that
[TABLE]
is eventually non-negative or non-positive . Note that since ,
[TABLE]
Therefore, for all , i.e, the hazard rate is always increasing and assumption (d) is satisfied. Now, we are left to show that is eventually non-negative or non-positive for any . Note that
[TABLE]
Therefore,
[TABLE]
From Equation (25) we know that
[TABLE]
Hence, we conclude that
[TABLE]
which implies that is eventually non-positive for any , i.e, assumption (e) holds for any .
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abernethy et al. [2014] Jacob Abernethy, Chansoo Lee, Abhinav Sinha, and Ambuj Tewari. Online linear optimization via smoothing. In COLT , pages 807–823, 2014.
- 2Abernethy et al. [2015] Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. Fighting bandits with a new kind of smoothness. In Advances in Neural Information Processing Systems 28 , pages 2188–2196, 2015.
- 3Audibert and Bubeck [2009] Jean-Yves Audibert and Sébastien Bubeck. Minimax policies for adversarial and stochastic bandits. In COLT , pages 217–226, 2009.
- 4Audibert et al. [2011] Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Minimax policies for combinatorial prediction games. In COLT , 2011.
- 5Auer et al. [2002] Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The non-stochastic multi-armed bandit problem. SIAM J. Comput. , 32:48–77, 2002.
- 6Bagnoli and Bergstrom [2005] Mark Bagnoli and Ted Bergstrom. Log-concave probability and its applications. Economic Theory , 26(2):445–469, 2005.
- 7Baricz [2008] Árpád Baricz. Mills’ ratio: Monotonicity patterns and functional inequalities. J. Math. Anal. Appl. , 340(2):1362–1370, 2008.
- 8Bertsekas [1973] Dimitri P. Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals. Journal of Optimization Theory and Applications , 12(2):218–231, 1973. ISSN 0022-3239.
