Equipping Experts/Bandits with Long-term Memory
Kai Zheng, Haipeng Luo, Ilias Diakonikolas, Liwei Wang

TL;DR
This paper introduces a reduction-based method for achieving long-term memory guarantees in online learning, providing new algorithms with improved regret bounds and extending results to sparse bandit settings.
Contribution
It presents the first reduction-based approach for long-term memory guarantees in online learning, achieving optimal regret bounds and extending to sparse multi-armed bandits.
Findings
Developed algorithms with regret of order √T(S ln T + n ln K).
Achieved simultaneous adaptation to stochastic and adversarial environments.
Provided lower bounds showing sparse losses do not improve worst-case regret.
Abstract
We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret. Specifically, for the classical expert problem with actions and rounds, using our framework we develop various algorithms with a regret bound of order compared to any sequence of experts with switches among distinct experts. In addition, by plugging specific adaptive algorithms into our framework we also achieve the best of both stochastic and adversarial environments simultaneously. This resolves an open problem of Warmuth and Koolen, 2014. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Machine Learning and Algorithms
\TOCclone
[Contents (Appendix)]tocatoc
\AfterTOCHead[toc] \AfterTOCHead[atoc]
††1 Key Laboratory of Machine Perception, MOE, School of EECS, Peking University ††2 Center for Data Science, Peking University ††{}^{3}\leavevmode\nobreak\ University of Southern California ††{}^{4}\leavevmode\nobreak\ University of Wisconsin-Madison
Equipping Experts/Bandits with Long-term Memory
Kai Zheng1,2
[email protected] &Haipeng Luo3
[email protected] Ilias Diakonikolas4
[email protected] &Liwei Wang1,2
Abstract
We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth [8], by reducing the problem to achieving typical switching regret. Specifically, for the classical expert problem with actions and rounds, using our framework we develop various algorithms with a regret bound of order compared to any sequence of experts with switches among distinct experts. In addition, by plugging specific adaptive algorithms into our framework we also achieve the best of both stochastic and adversarial environments simultaneously. This resolves an open problem of Warmuth and Koolen [35]. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a side result, our lower bound also implies that sparse losses do not help improve the worst-case regret for contextual bandits, a sharp contrast with the non-contextual case.
1 Introduction
In this work, we propose a black-box reduction for obtaining long-term memory guarantees for two fundamental problems in online learning: the expert problem [17] and the multi-armed bandit (MAB) problem [6]. In both problems, a learner interacts with the environment for rounds, with fixed available actions. At each round, the environment decides the loss for each action while simultaneously the learner selects one of the actions and suffers the loss of this action. In the expert problem, the learner observes the loss of every action at the end of each round (a.k.a. full-information feedback), while in MAB, the learner only observes the loss of the selected action (a.k.a. bandit feedback).
For both problems, the classical performance measure is the learner’s (static) regret, defined as the difference between the learner’s total loss and the loss of the best fixed action. It is well-known that the minimax optimal regret is [17] and [6, 4] for the expert problem and MAB respectively. Comparing against a fixed action, however, does not always lead to meaningful guarantees, especially when the environment is non-stationary and no single fixed action performs well. To address this issue, prior work has considered a stronger measure called switching/tracking/shifting regret, which is the difference between the learner’s total loss and the loss of a sequence of actions with at most switches. Various existing algorithms (including some black-box approaches) achieve the following switching regret
[TABLE]
We call these typical switching regret bounds. Such bounds essentially imply that the learner pays the worst-case static regret for each switch in the benchmark sequence. While this makes sense in the worst case, intuitively one would hope to perform better if the benchmark sequence frequently switches back to previous actions, as long as the algorithm remembers which actions have performed well previously.
Indeed, for the expert problem, algorithms with long-term memory were developed that guarantee switching regret of order , where is the number of distinct actions in the benchmark sequence [8, 2, 13].111The setting considered in [8, 2] is in fact slightly different from, yet closely related to, the expert problem. One can easily translate their regret bounds into the bounds we present here. Although there is no known lower bound, this regret bound essentially matches the one achieved by a computationally inefficient approach of running Hedge over all benchmark sequences with switches among experts, an approach that usually leads to the information-theoretically optimal regret guarantee. Compared to the typical switching regret bound of form (1) (which can be written as ), this long-term memory guarantee implies that the learner pays the worst-case static regret only for each distinct action encountered in the benchmark sequence, and pays less for each switch, especially when is very small. Algorithms with long-term memory guarantees have been found to have better empirical performance [8], and applied to practical applications such as TCP round-trip time estimation [30], intrusion detection system [29], and multi-agent systems [31]. We are not aware of any similar studies for the bandit setting.
Overview of our contributions.
The main contribution of this work is to propose a simple black-box approach to equip expert or MAB algorithms with long-term memory and to achieve switching regret guarantees of similar flavor to those of [8, 2, 13]. The key idea of our approach is to utilize a variant of the confidence-rated expert framework of [7], and to use a sub-routine to learn the confidence/importance of each action for each time. Importantly this sub-routine itself is an expert/bandit algorithm over only two actions and needs to enjoy some typical switching regret guarantee (for example of form (1) for the expert problem). In other words, our approach reduces the problem of obtaining long-term memory to the well-studied problem of achieving typical switching regret. Compared to existing methods [8, 2, 13], the advantages of our approach are the following:
-
While existing methods are all restricted to variants of the classical Hedge algorithm [17], our approach allows one to plug in a variety of existing algorithms and to obtain a range of different algorithms with switching regret . (Section 3.1)
-
Due to this flexibility, by plugging in specific adaptive algorithms, we develop a parameter-free algorithm whose switching regret is simultaneously in the worst-case and if the losses are piece-wise stochastic (see Section 2 for the formal definition). This is a generalization of previous best-of-both-worlds results for static or switching regret [19, 27], and resolves an open problem of Warmuth and Koolen [35]. The best previous bound for the stochastic case is [27]. (Section 3.2)
-
Our framework allows us to derive the first nontrivial long-term memory guarantees for the bandit setting, while existing approaches fail to do so (more discussion to follow). For example, when is a constant and the losses are sparse, our algorithm achieves switching regret for MAB, which is better than the typical bound (2) when and are large. For example, when and , our bound is of order while bound (2) becomes vacuous (linear in ), demonstrating a strict separation in learnability. (Section 4)
To motivate our results on long-term memory guarantees for MAB, a few remarks are in order. It is not hard to verify that existing approaches achieve switching regret for MAB. However, the polynomial dependence on the number of actions makes the improvement of this bound over the typical bound (2) negligible. It is well-known that such polynomial dependence on is unavoidable in the worst-case due to the bandit feedback. This motivates us to consider situations where the necessary dependence on is much smaller. In particular, Bubeck et al. [10] recently showed that if the loss vectors are -sparse, then a static regret bound of order is achievable, exhibiting a much more favorable dependence on . We therefore focus on this sparse MAB problem and study what nontrivial switching regret bounds are achievable.
We first show that a bound of order , a natural generalization of the typical switching regret bound of (2) to the sparse setting, is impossible. In fact, we show that for any the worst-case switching regret is at least , even when . Since achieving switching regret for MAB can be seen as a special case of contextual bandits [6, 26], this negative result also implies that, surprisingly, sparse losses do not help improve the worst-case regret for contextual bandits, which is a sharp contrast with the non-contextual case studied in [10] (see Theorem 6 and Corollary 7). Despite this negative result, however, as mentioned we are able to utilize our general framework to still obtain improvements over bound (2) when is small. Our construction is fairly sophisticated, requiring a special sub-routine that uses a novel one-sided log-barrier regularizer and admits a new kind of “local-norm” guarantee, which may be of independent interest.
2 Preliminaries
Throughout the paper, we use to denote the set for some integer . The learning protocol for the expert problem and MAB with actions and rounds is as follows: For each time , (1) the learner first randomly selects an action according to a distribution (the -dimensional simplex); (2) simultaneously the environment decides the loss vector ; (3) the learner suffers loss and observes either in the expert problem (full-information feedback) or only in MAB (bandit feedback). For any sequence of actions , the expected regret of the learner against this sequence is defined as
[TABLE]
where the expectation is with respect to both the learner and the environment and , the instantaneous regret (against action ), is defined as . When , this becomes the traditional static regret against a fixed action. Most existing works on switching regret impose a constraint on the number of switches for the benchmark sequence: . In other words, the sequence can be decomposed into disjoint intervals, each with a fixed comparator as in static regret. Typical switching regret bounds hold for any sequence with this constraint and are in terms of and , such as Eq. (1) and Eq. (2).
The number of switches, however, does not fully characterize the difficulty of the problem. Intuitively, a sequence that frequently switches back to previous actions should be an easier benchmark for an algorithm with long-term memory that remembers which actions performed well in the past. To encode this intuition, prior works [8, 2, 13] introduced another parameter , the number of distinct actions in the sequence, to quantify the difficulty of the problem, and developed switching regret bounds in terms of and . Clearly one has , and we are especially interested in the case when , which is natural if the data exhibits some periodic pattern. Our goal is to understand what improvements are achievable in this case and how to design algorithms that can leverage this property via a unified framework.
Stochastic setting.
In general, we do not make any assumptions on how the losses are generated by the environment, which is known as the adversarial setting in the literature. We do, however, develop an algorithm (for the expert problem) that enjoys the best of both worlds — it not only enjoys some robust worst-case guarantee in the adversarial setting, but also achieves much smaller logarithmic regret in a stochastic setting. Specifically, in this stochastic setting, without loss of generality, we assume the distinct actions in are . It is further assumed that for each , there exists a constant gap such that for all and all such that , where the expectation is with respect to the randomness of the environment conditioned on the history up to the beginning of round . In other words, for every time step the algorithm is compared to the best action whose expected value is constant away from those of other actions. This is a natural generalization of the stochastic setting studied for static regret or typical switching regret [19, 27].
Confidence-rated actions.
Our approach makes use of the confidence-rated expert setting of Blum and Mansour [7], a generalization of the expert problem (and the sleeping expert problem [18]). The protocol of this setting is the same as the expert problem, except that at the beginning of each round, the learner first receives a confidence score for each action . The regret against a fixed action is also scaled by its confidence and is now defined as . The expert problem is clearly a special case with for all and . There are a number of known examples showing why this formulation is useful, and our work will add one more to this list.
To obtain a bound on this new regret measure, one can in fact simply reduce it to the regular expert problem [7, 19, 27]. Specifically, let be some expert algorithm over the same actions producing sampling distributions . The reduction works by sampling according to such that and then feeding to where . Note that by the definition of one has . Therefore, one can directly equalize the confidence-rated regret and the regular static regret of the reduced problem: .
3 General Framework for the Expert Problem
In this section, we introduce our general framework to obtain long-term memory regret bounds and demonstrate how it leads to various new algorithms for the expert problem. We start with a simpler version and then move on to a more elaborate construction that is essential to obtain best-of-both-worlds results.
3.1 A simple approach for adversarial losses
A simple version of our approach is described in Algorithm 1. At a high level, it simply makes use of the confidence-rated action framework described in Section 2. The reduction to the standard expert problem is executed in Lines 1 and 1, with a black-box expert algorithm .
It remains to specify how to come up with the confidence score . We propose to learn these scores via a separate black-box expert algorithm for each . More specifically, each is learning over two actions 0 and 1, where action 0 corresponds to confidence score 0 and action 1 corresponds to score 1. Therefore, the probability of picking action 1 at time naturally represents a confidence score between 0 and 1, which we denote by overloading the notation (Line 1).
As for the losses fed to , we fix the loss of action 0 to be 0 (since shifting losses by the same amount has no real effect), and set the loss of action 1 to be (Line 1). The role of the term is intuitively clear — the larger the loss of action compared to the algorithm, the less confident we should be about it; the role of the constant bias term will become clear in the analysis (in fact, it can even be removed at the cost of a worse bound — see Appendix B.2).
Finally we specify what properties we require from the black-box algorithms . In short, needs to ensure a static regret bound, while need to ensure a switching regret bound. See Figure 1 for an illustration of our reduction. The trick is that since are learning over only two actions, this construction helps us to separate the dependence on and the number of switches . These (static or switching) regret bounds could be the standard worst-case -dependent bounds mentioned in Section 1, in which case we would obtain looser long-term memory guarantees (specifically, times worse — see Appendix B.2). Instead, we require these bounds to be data-dependent and in particular of the form specified below:
Condition 1**.**
There exists a constant such that for any and any loss sequence , algorithm (possibly with knowledge of ) produces sampling distributions and ensures one of the following static regret bounds:
[TABLE]
Condition 2**.**
There exists a constant such that for any , any loss sequence , and any , algorithm (possibly with knowledge of ) produces sampling distributions and ensures one of the following switching regret bounds against any sequence with :222In terms of notation in Algorithm 1, .
[TABLE]
We emphasize that these data-dependent bounds are all standard in the online learning literature,333In fact, most standard bounds replace the absolute value we present here with square, leading to even smaller bounds (up to a constant). We choose to use the looser ones with absolute values since this makes the conditions weaker while still being sufficient for all of our analysis. and provide a few examples below (see Appendix A for brief proofs).
Proposition 1**.**
The following algorithms all satisfy Condition 1: Variants of Hedge [20, 34], Prod [12], Adapt-ML-Prod [19], AdaNormalHedge [27], and iProd/Squint [25].
Proposition 2**.**
The following algorithms all satisfy Condition 2: Fixed-share [23], a variant of Fixed-share (Algorithm 5 in Appendix A), and AdaNormalHedge.TV [27].
We are now ready to state the main result for Algorithm 1 (see Appendix B.1 for the proof).
Theorem 3**.**
Suppose Conditions 1 and 2 both hold. With , Algorithm 1 ensures for any loss sequence and benchmark sequence such that and .
Our bound in Theorem 3 is slightly worse than the existing bound of [8, 2],444In fact, using the adaptive guarantees of AdaNormalHedge [27] or iProd/Squint [25] that replaces the dependence in Eq. (4) by a KL divergence term, one can further improve the term in our bound to matching previous bounds. Since this improvement is small, we omit the details.
but still improves over the typical switching regret (Eq. (1)), especially when is small and and are large. To better understand the implication of our bounds, consider the following thought experiment. If the learner knew about the switch points (that is, ) that naturally divide the whole game into intervals, she could simply pick any algorithm with optimal static regret () and apply instances of this algorithm, one for each interval, which, via a direct application of the Cauchy-Schwarz inequality, leads to switching regret . Compared to bound (1), this implies that the price of not knowing the switch points is . Similarly, if the learner knew not only the switch points, but also the information on which intervals share the same competitor, then she could naturally apply instances of the static algorithm, one for each set of intervals with the same competitor. Again by the Cauchy-Schwarz inequality, this leads to switching regret . Therefore, our bound implies that the price of not having any prior information of the benchmark sequence is still .
Compared to existing methods, our framework is more flexible and allows one to plug in any combination of the algorithms listed in Propositions 1 and 2. This flexibility is crucial and allows us to solve the problems discussed in the following sections. The approach of [2] makes use of a sleeping expert framework, a special case of the confidence-rated expert framework. However, their approach is not a general reduction and does not allow plugging in different algorithms. Finally, we note that our construction also shares some similarity with the black-box approach of [14] for a multi-task learning problem.
3.2 Best of both worlds
To further demonstrate the power of our approach, we now show how to use our framework to construct a parameter-free algorithm that enjoys the best of both adversarial and stochastic environments, resolving the open problem of [35] (see Algorithm 2). The key is to derive an adaptive switching regret bound that replaces the dependence on by the sum of the magnitudes of the instantaneous regret , which previous works [19, 27] show is sufficient for adapting to the stochastic setting and achieving logarithmic regret.
To achieve this goal, the first modification we need is to change the bias term for the loss of action “1” for from to . Following the proof of Theorem 3, one can show that the dependence on now becomes for the regret against . If we could tune optimally in terms of this data-dependent quality, then this would imply logarithmic regret in the stochastic setting by the same reasoning as in [19, 27].
However, the difficulty is that the optimal tuning of is unknown beforehand, and more importantly, different actions require tuning differently. To address this issue, at a high level we discretize the learning rate and pick exponentially increasing values (Line 2), then we make copies of each action , one for each learning rate . More specifically, this means that the number of actions for increases from to , and so does the number of sub-routines with switching regret, now denoted as for and . Different copies of an action share the same loss for , while action “1” for now suffers loss (Line 2). The rest of the construction remains the same. Note that selecting a copy of an action is the same as selecting the corresponding action, which explains the update rule of the sampling probability in Line 2 that marginalizes over . Also note that for a vector in (e.g., ), we use to index its coordinates for and .
Finally, with this new construction, we need algorithm to exhibit a more adaptive static regret bound and in some sense be aware of the fact that different actions now correspond to different learning rates. More precisely, we replace Condition 1 with the following condition:
Condition 3**.**
There exists a constant such that for any and any loss sequence , algorithm (possibly with knowledge of ) produces sampling distributions and ensures the following static regret bounds: for all and :555In fact an analogue of Eq. (3) with individual learning rates would also suffice, but we are not aware of any algorithms that achieve such guarantee.
[TABLE]
Once again, this requirement is achievable by many existing algorithms and we provide some examples below (see Appendix A for proofs).
Proposition 4**.**
The following algorithms all satisfy Condition 3: A variant of Hedge (Algorithm 6 in Appendix A), Adapt-ML-Prod [19], AdaNormalHedge [27], and iProd/Squint [25].
We now state our main result for Algorithm 2 (see Appendix B.3 for the proof).
Theorem 5**.**
Suppose algorithm satisfies Condition 3 and all satisfy Condition 2. Algorithm 2 ensures that for any benchmark sequence such that and , the following hold:
- •
In the adversarial setting, we have
- •
In the stochastic setting (defined in Section 2), we have where s.t. .666This definition of is the same as the one in the proof of Theorem 3.
In other words, with a negligible price of for the adversarial setting, our algorithm achieves logarithmic regret in the stochastic setting with favorable dependence on and . The best prior result is achieved by AdaNormalHedge.TV [27], with regret for the adversarial case and for the stochastic case. We also remark that a variant of the algorithm of [8] with a doubling trick can achieve a guarantee similar to ours, but weaker in the sense that each is replaced by . To the best of our knowledge this was previously unknown and we provide the details in Appendix B.4 for completeness.
4 Long-term Memory under Bandit Feedback
In this section, we move on to the bandit setting where the learner only observes the loss of the selected action instead of . As mentioned in Section 1, one could directly generalize the approach of [8, 2, 13] to obtain a bound of order , a natural generalization of the full information guarantee, but such a bound is not a meaningful improvement compared to (2), due to the dependence that is unavoidable for MAB in the worst case. Therefore, we consider a special case where the dependence on is much smaller: the sparse MAB problem [10]. Specifically, in this setting we make the additional assumption that all loss vectors are -sparse for some , that is, for all . It was shown in [10] that for sparse MAB the static regret is of order , exhibiting a much favorable dependence on .
Negative result.
To the best of our knowledge, there are no prior results on switching regret for sparse MAB. In light of bound (2), a natural conjecture would be that it would be possible to achieve switching regret of with switches. Perhaps surprisingly, we show that this is in fact impossible.
Theorem 6**.**
For any and any MAB algorithm, there exists a sequence of loss vectors that are -sparse, such that the switching regret of this algorithm is at least .
The high level idea of the proof is to force the algorithm to overfocus on one good action and thus miss an even better action later. This is similar to the construction of [15, Lemma 3] and [37, Theorem 4.1], and we defer the proof to Appendix C.1. This negative result implies that sparsity does not help improve the typical switching regret bound (2). In fact, since switching regret for MAB can be seen as a special case of the contextual bandits problem [6, 26], this result also immediately implies the following corollary, a sharp contrast compared to the positive result for the non-contextual case mentioned earlier (see Appendix C.1 for the definition of contextual bandit and related discussions).
Corollary 7**.**
Sparse losses do not help improve the worst-case regret for contextual bandits.
Long-term memory to the rescue.
Despite the above negative results, we next show how long-term memory can still help improve the switching regret for sparse MAB. Specifically, we use our general framework to develop a MAB algorithm whose switching regret is smaller than whenever and are small while and are large. Note that this is not a contradiction with Theorem 6, since in the construction of its proof, is as large as .
At a high level, our algorithm (Algorithm 3) works by constructing the standard unbiased importance-weighted loss estimator (Line 3) and plugging it into our general framework (Algorithm 1). However, we emphasize that it is highly nontrivial to control the variance of these estimators without leading to bad dependence on in this framework where two types of sub-routines interact with each other. To address this issue, we design specialized sub-algorithms and to learn and respectively. For learning , we essentially deploy the algorithm of [10] for sparse MAB, which is an instance of the standard follow-the-regularized-leader algorithm with a special hybrid regularizer, combining the entropy and the log-barrier (Lines 3 and 3). However, note that the loss we feed to this algorithm is not sparse and we cannot directly apply the guarantee from [10], but it turns out that one can still utilize the implicit exploration of this algorithm, as shown in our analysis. Compared to Algorithm 1, we also incorporate an extra bias term in the definition of (Line 3), which is important for canceling the large variance of the loss estimator.
For learning for each , we design a new algorithm that is an instance of the standard Online Mirror Descent algorithm (see e.g., [22]). Recall that this is a one-dimensional problem, as we are trying to learn the distribution over actions . We design a special one-dimensional regularizer , which can be seen as a one-sided log-barrier,777The usual log-barrier regularizer (see e.g. [16, 3, 36]) would be in this case. to bias towards action “1”. Technically, this provides a special “local-norm” guarantee that is critical for our analysis and may be of independent interest (see Lemma 14 in Appendix C.2). In addition, we remove the bias term in the loss for action “1” (so it is only now) as it does not help in the bandit case, and we also force to be at least for some parameter , which is important for achieving switching regret. Line 3 summarizes the update for .
Finally, we also enforce a small amount of uniform exploration by sampling from , a smoothed version of (Line 3). We present the main result of our algorithm below (proven in Appendix C.2).
Theorem 8**.**
With , Algorithm 3 ensures
[TABLE]
for any sequence of -sparse losses and any benchmark sequence such that and .
In the case when and are constants, our bound (9) becomes , which improves over the existing bound when (also recall the example in Section 1 where our bound is sublinear in while existing bounds become vacuous).
As a final remark, one might wonder if similar best-of-both-worlds results are also possible for MAB in terms of switching regret, given the positive results for static regret [9, 33, 5, 32, 36, 38]. We point out that the answer is negative — the proof of [37, Theorem 4.1] implicitly implies that even with one switch, logarithmic regret is impossible for MAB in the stochastic setting.
5 Conclusion
In this work, we propose a simple reduction-based approach to obtaining long-term memory regret guarantee. By plugging various existing algorithms into this framework, we not only obtain new algorithms for this problem in the adversarial case, but also resolve the open problem of Warmuth and Koolen [35] that asks for a single algorithm achieving the best of both stochastic and adversarial environments in this setup. We also extend our results to the bandit setting and show both negative and positive results.
One clear open question is whether our bound for the bandit case (Theorem 8) can be improved, and more generally what is the best achievable bound in this case.
Acknowledgments.
The authors would like to thank Alekh Agarwal, Sébastien Bubeck, Dylan Foster, Wouter Koolen, Manfred Warmuth, and Chen-Yu Wei for helpful discussions. Kai Zheng and Liwei Wang were supported by Natioanl Key R&D Program of China (no. 2018YFB1402600), BJNSF (L172037). Haipeng Luo was supported by NSF Grant IIS-1755781. Ilias Diakonikolas was supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship.
Appendix A Examples of Sub-routines
In this section, we briefly discuss why the algorithms listed in Propositions 1, 2, and 4 satisfy Conditions 1, 2, and 3 respectively. We first note that except for AdaNormalHedge [27], all other algorithms satisfy even tighter bounds with the absolute value replaced by square (also see Footnote 3).
A.1 Condition 1
Prod [12] with learning rate satisfies Eq. (4) according to its original analysis. Adapt-ML-Prod [19], AdaNormalHedge [27], and iProd/Squint [25] are all parameter-free algorithms that satisfy for all ,
[TABLE]
By AM-GM inequality the square root term can be upper bounded by for any . Also the constraint in Condition 1 allows one to bound the extra term by . This leads to Eq. (4).
Finally, for completeness we present a variant of Hedge (Algorithm 4) that can be extracted from [20, 34] and that satisfies Eq. (3).
Proposition 9**.**
Algorithm 4 satisfies Eq. (3).
Proof.
Define where with and . The goal is to show , which implies for any , and thus Eq. (3) after rearranging. Indeed, for any we have
[TABLE]
where the first inequality uses the fact for any , the second inequality uses the fact for any , and the last equality holds since and . ∎
A.2 Condition 2
We first note that the three algorithms we include in Proposition 2 all work for an arbitrary number of actions (instead of just two actions) and the general guarantee will be in the same form of Eq. (5), (6), and (7) except that is replaced by .
Fixed-share [23] with learning rate satisfies Eq. (7) and the proof can be extracted from the proof of [6, Theorem 8.1] or [28, Theorem 2]. AdaNormalHedge.TV [27] is again a parameter-free algorithm and achieves the bound of (6) using similar tricks mentioned earlier for Condition 1.
Finally we provide a variant of Fixed-share that satisfies Eq. (5). The pseudocode is in Algorithm 5, where we adopt the notation from Condition 2 ( for distribution, for loss, for action index) but present the general case with actions.
Proposition 10**.**
Algorithm 5 satisfies Eq. (5).
Proof.
We first write the algorithm as an instance of Online Mirror Descent. Let be the entropy regularizer, and be such that where represents the element-wise square. Then one can verify and , where is the Bregman divergence associated with . Now we have for any ,
[TABLE]
where the first inequality is by the generalized Pythagorean theorem, the second inequality is by the fact for all , and the last one is by the definition of and the fact for any . Rearranging then gives
[TABLE]
A benchmark sequence with switches naturally divides the sequence into intervals, and for each interval , by summing up the inequality above from to and telescoping we have
[TABLE]
Finally summing over all intervals, setting to put all weight on the corresponding competitor, and realizing finish the proof. ∎
A.3 Condition 3
To simplify notation, we use to denote the number of actions (instead of ) and prove the following
[TABLE]
which clearly implies Eq. (8). Once again since Adapt-ML-Prod [19], AdaNormalHedge [27], and iProd/Squint [25] are all parameter-free algorithms satisfying Eq. (10), they also ensure Eq. (11) for any by the same reasoning mentioned for Condition 1. Next we present a variant of Hedge (Algorithm 6) with individual learning rate for each action and prove the following.
Proposition 11**.**
Algorithm 6 satisfies Eq. (11).
Proof.
Define . We have
[TABLE]
where the inequality holds by the fact for any and the equality holds because and . Therefore,
[TABLE]
Solving for then proves Eq. (11). ∎
Appendix B Proofs for Section 3
In this section we provide proofs and related discussions for our algorithms under full-information feedback (i.e. the expert problem).
B.1 Proof of Theorem 3
Proof.
For each distinct action in , we first apply the static regret bound of stated in Condition 1 (either Eq. (3) or Eq. (4)). With the fact and this gives
[TABLE]
Next we apply the switching regret bound of stated in Condition 2 with if and otherwise (note that and ). This gives with and
[TABLE]
where is
[TABLE]
In either case, using the fact we have
[TABLE]
Combining this inequality with Eq. (13) and rearranging give
[TABLE]
Further combining inequalities (12) and (14) and canceling terms give
[TABLE]
Finally summing over , using the fact , , , and the choice of finish the proof. ∎
B.2 A weaker bound via weaker conditions
Condition 1 and Condition 2 require some data-dependent regret bounds. In fact, one can even relax these conditions and replace the data-dependent regret bounds with worst-case -dependent bounds, leading to a slightly weaker long-term memory guarantee. Specifically, if we replace the bounds in Condition 1 and Condition 2 by standard worst-case static and switching regret bounds
[TABLE]
respectively, then by setting in Algorithm 1 (that is, removing the bias term in the loss for ) and redoing the proof of Theorem 3 in a similar way one can verify that Eq. (15) now becomes
[TABLE]
which finally leads to
[TABLE]
via Cauchy-Schwarz inequality. Compared to our bound in Theorem 3, this leads to an extra factor.
B.3 Proof of Theorem 5
Proof.
The first step is to prove that for each distinct action , Algorithm 2 ensures
[TABLE]
The proof is similar to that of Theorem 3. We first apply the static regret bound of stated in Condition 3, which gives for any and ,
[TABLE]
Here we use the fact
[TABLE]
where is the normalization factor. Next we apply the switching regret bound of stated in Condition 2 with , if and otherwise (note that and ). This gives with ,
[TABLE]
where is
[TABLE]
In either case, using the fact and thus , we have
[TABLE]
Combining this inequality with Eq. (18) and rearranging give
[TABLE]
Further combining inequalities (17) and (19) and canceling terms give
[TABLE]
Now we pick such that
[TABLE]
which is always possible by the construction of . This proves Eq. (16).
Adversarial setting.
We simply bound by 2 in Eq. (16). The rest is the same as the proof of Theorem 3: summing over , applying Cauchy-Schwarz inequality, and using the fact , , , , prove .
Stochastic setting.
The proof is similar to that of [27] and solely replies on the adaptive bound (16). Recall that in the stochastic setting, without loss of generality we assume . For each there exists a constant gap such that for all and all such that . This implies
[TABLE]
On the other hand, we have
[TABLE]
Combining the two inequalities above with Eq. (16) and by AM-GM inequality, we know that there exists a constant such that
[TABLE]
Rearranging proves
[TABLE]
and thus
[TABLE]
Summing over finishes the proof. ∎
B.4 A weaker best-of-both-worlds result
In this section we present a version of the “Mixing Past Posteriors” algorithm of [8, 2, 13] with a particular doubling trick and show that it also provides some similar but weaker best-of-both-worlds results. As far as we know this is unknown previously.
The pseudocode is in Algorithm 7. It is a variant of Hedge where each time the sampling distribution mixes all the past distributions. We apply a standard doubling trick to the quantity , an important data-dependent quantity that turns out to be useful for adapting to the stochastic setting (similar to the role of in Eq. (16)). Specifically the algorithm satisfies the following adaptive switching regret bound.
Theorem 12**.**
Algorithm 7 ensures
[TABLE]
for any loss sequence and benchmark sequence such that and . This implies that
- •
in the adversarial setting, we have
- •
in the stochastic setting (defined in Section 2), we have
Compared to our bounds in Theorem 5, one can see that the stochastic bound here is weaker in the sense that all ’s are replaced by . At a technical level, this is because this algorithm only admits an adaptive regret bound (20) over the entire horizon, instead of a bound like Eq. (16) that holds over segments with the same competitor.
Proof.
Similar to the proof of Proposition 10, we start by writing the algorithm as an instance of Online Mirror Descent. Let be the entropy regularizer, and be such that . Then one can verify and , where is the Bregman divergence associated with . Now we have for any , we have
[TABLE]
Now consider a period between two resets of the algorithm that starts at time and ends at time . Let be one plus the most recent time when is the competitor (if the set is empty, is defined as ). Note that by the definition of we have
[TABLE]
Therefore, combining previous bounds we have for any ,
[TABLE]
Summing over in this period and telescoping lead to
[TABLE]
Finally suppose there are periods in total, then
[TABLE]
Note that in this case by the restart condition one must also have , which implies Eq. (20) (by dropping the lower order term for simplicity).
Adversarial setting.
Simply upper bound by .
Stochastic setting.
This is similar to the proof of Theorem 5. We make the following two observations. First, by the definition of the stochastic setting we have
[TABLE]
On the other hand, we have and thus
[TABLE]
Combining the two inequalities above with Eq. (20) and by AM-GM inequality, we know that there exists a constant such that
[TABLE]
Rearranging proves
[TABLE]
and thus the claimed regret bound. ∎
Appendix C Proofs for Section 4
In this section we provide the omitted proofs for Section 4.
C.1 Negative results
Proof of Theorem 6.
Divide the whole horizon evenly into intervals. Our goal is to show that for any algorithm , there exists a sequence of -sparse loss vectors such that the switching regret of against a benchmark with at most 2 switches on each of these intervals is at least , this clearly implies that the overall switching regret against a benchmark with at most switches is at least .
To show this, consider a fixed interval and consider the behavior of against a fixed loss vector for the entire interval ( represents a basis vector). Let be the expected number of times that action 1 is not selected by on this interval (a fixed number conditioned on everything prior to this interval). If , then the (static) regret of against action 1 on this interval is already . Otherwise, there must exist an action such that in expectation it is selected for less than times. In this case, there must also exist a subinterval of length where in expectation action is selected for less than times. This means that with probability at least , action is not selected at all on this subinterval. If we switch the loss vector from to starting from the beginning of this subinterval, suffers expected regret against action after the switch point. In other words, in this case the switching regret of (first against and then against ) is , finishing the proof. ∎
To prove Corollary 7, we first remind the reader the contextual bandit setting [6, 26]. It is a generalization of the MAB problem where at the beginning of each round , the learner first observes a context from some arbitrary context space , and then selects an action and observes its loss . The learner is given a fixed set of policies beforehand where each policy is a mapping from to . The (static) regret of the learner against a fixed policy is now defined as
[TABLE]
The optimal regret for a finite policy class is known to be .
It is well-known that one can reduce the problem of achieving switching regret (with switches) for MAB to the problem of achieving static regret for contextual bandit. To do this, simply let and be the set of action sequences with length and switches. For a policy that corresponds to the action sequence , its output at time is simply . Comparing the regret definitions it is clear that the static regret for this contextual bandit problem exactly corresponds to the switching regret for MAB. Moreover, since the size of in this case is , a static regret of form exactly recovers the typical switching regret bound of form (2). Now it is clear that Corollary 7 is directly implied by Theorem 6.
C.2 Proof of Theorem 8
The proof relies on the following two lemmas, which respectively state the static and switching regret guarantees for algorithm (that learns ) and algorithm (that learns ).
Lemma 13**.**
With , Algorithm 3 ensures for any ,
[TABLE]
Lemma 14**.**
For any , Line 3 of Algorithm 3 ensures
[TABLE]
for any sequence of and any competitor sequence with .
The bound in Lemma 13 resembles the one of [10] for sparse MAB, but as mentioned since is not sparse (nor can it be made sparse after shifting), it requires a different analysis. The bound in Lemma 14 contains a “local-norm” term that resembles the one achieved by Hedge in the full information setting. However, importantly this holds for any real-valued sequence of , while Hedge requires the losses to be bounded from one side. We are not able to prove the same bound with the usual log barrier regularizer (see Footnote 7) either. As far as we know this lemma is new and might be of independent interest.
Combining these two lemmas we now provide the proof for Theorem 8, followed by the proofs of these lemmas.
Proof of Theorem 8.
First note that by the definition of and one has
[TABLE]
For each distinct action , applying Lemma 13 and rearranging then lead to
[TABLE]
Next we apply Lemma 14 by setting if and otherwise, which gives
[TABLE]
Let denote the expectation conditioned on the history up to the beginning of round . It is clear that is unbiased: , and thus Also we have
[TABLE]
where the first inequality uses the fact (since it is either [math] or ), the second inequality uses the definition of , and the last one uses . Combining these with Eq. (22) and Eq. (23) gives
[TABLE]
It remains to bound
[TABLE]
which implies
[TABLE]
Summing over and using the fact and give
[TABLE]
Plugging in the parameters and proves the theorem. ∎
Proof of Lemma 13.
The proof is in similar spirit of those of [10, 11]. Define for a semi-definite matrix the associated norm for a vector as . By standard analysis of Follow-the-Regularized-Leader, we have for any ,
[TABLE]
where is some point on the segment connecting and , and is the Bregman divergence associated with . Set . One can verify and , and thus
[TABLE]
The rest of the proof consists of two steps. First, we prove that the algorithm is stable in the sense that for all and , which implies . The second step is to show . Combining these two steps finishes the proof.
First step.
To prove the stability, it suffices to show . Indeed, this is because , where represents the dimensional diagonal matrix whose -th diagonal element is , and thus implies , which further implies and thus .
To prove , define so that . We will prove for any such that , which then implies by the convexity of .
Indeed, by Taylor’s expansion, there exists some on the line segment joining and , such that
[TABLE]
where the first inequality is by the first order optimality of and the second is by Hölder’s inequality. Note that is between and , which implies and similar to previous discussions. Therefore, we have and thus
[TABLE]
Next we show , which will finish the proof for the stability.
[TABLE]
Second step.
With the stability, it is clear that . Now we show . Note that this is similar to previous calculations, but the expectation allows us to bound the term by something even smaller. Specifically, we continue from the intermediate step of the previous calculation
[TABLE]
Now we use the fact and to continue with
[TABLE]
This finishes the proof. ∎
Proof of Lemma 14.
By the definition of and first order optimality, one has
[TABLE]
which after rearranging gives
[TABLE]
Summing over , telescoping, and realizing since and are in , we arrive at
[TABLE]
It remains to prove . For notational convenience, given any , let and . If we can prove , then we finish the proof by setting and (which gives and ).
To show . Realize that the optimizations are one dimensional and admit the following solutions with explicit forms
[TABLE]
The rest of the proof is simply to show holds in all of the nine possible cases.
- A.
If , then holds trivially.
- B.
If and , then and and thus
[TABLE]
- C.
If and , then , , and , and thus
[TABLE]
- D.
If and , then and , and thus
[TABLE]
- E.
If and , then , and thus
[TABLE]
- F.
If and , then and , and thus
[TABLE]
- G.
If and , then , , and , and thus
[TABLE]
- H.
If and , then , , , and thus
[TABLE]
- I.
If , then holds trivially.
This finishes the proof. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Adamskiy et al. [2012 a] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In International Conference on Algorithmic Learning Theory , pages 290–304. Springer, 2012 a.
- 2Adamskiy et al. [2012 b] D. Adamskiy, M. K. Warmuth, and W. M. Koolen. Putting bayes to sleep. In Advances in neural information processing systems , pages 135–143, 2012 b.
- 3Agarwal et al. [2017] A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. Conference on Learning Theory , 2017.
- 4Audibert and Bubeck [2010] J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research , 11(Oct):2785–2836, 2010.
- 5Auer and Chiang [2016] P. Auer and C.-K. Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory , pages 116–120, 2016.
- 6Auer et al. [2002] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing , 32(1):48–77, 2002.
- 7Blum and Mansour [2007] A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning Research , 8(Jun):1307–1324, 2007.
- 8Bousquet and Warmuth [2002] O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research , 3(Nov):363–396, 2002.
