Thompson Sampling for Adversarial Bit Prediction
Yuval Lewi, Haim Kaplan, Yishay Mansour

TL;DR
This paper analyzes the performance of Thompson sampling in adversarial bit prediction, identifying sequences with minimal and maximal regret, and extending results to weighted error models.
Contribution
It characterizes adversarial sequences with extreme regret bounds and extends analysis to weighted false positive and false negative errors.
Findings
Sequences with alternating bits have maximal regret of O(√T).
Sequences of all ones or zeros have minimal regret of O(1).
Results extend to models with weighted false positives and negatives.
Abstract
We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length with zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worse case sequences have regret and the best case sequence have regret . We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and derive their regret bounds. We also show that there are sequences with regret.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning · Machine Learning and Algorithms
Thompson Sampling for Adversarial Bit Prediction
Yuval Lewi Haim Kaplan Yishay Mansour Tel Aviv University. Email: [email protected].Tel Aviv University and Google Research. Email: [email protected].Tel Aviv University and Google Research. Email: [email protected].
Abstract
We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length with zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worst case sequences have regret and the best case sequence have regret .
We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and derive their regret bounds. We also show that there are sequences with regret.
1 Introduction
Online learning and multi-arm bandits (MAB) are one of the most basic models for uncertainty, which are widely studied in machine learning. The main performance criteria used in this model is regret, which is the difference between the expected loss of the online algorithm, and the loss of the best algorithm from a benchmark class. (See, [1, 2, 3, 4]). Bit prediction is one of the first problems for which online learning regret was analyzed [5], and has been extensively studied throughout the years (see, [6]).
Thompson sampling ([7]) is one of the earliest algorithms for MAB. It was originally motivated by a Bayesian setting, where the rewards are stochastic, and the reward of each action has a prior distribution. The algorithm maintains a posterior distribution for the reward of each action, and in each step, samples the posterior distribution of the mean reward of each action, and uses the action with the highest sampled value. In recent years, there has been a renewed interest in the Thompson sampling algorithm and its applications (see, [8]), mainly due to its simplicity and good performance in practice.
Since Thompson sampling was designed for a Bayesian setting, it is natural to analyze its Bayesian regret (i.e., average the regret with respect to the prior). In many settings, we get an elegant analysis and asymptotically optimal regret bounds. (See, [3, 4, 9]).
While Thompson sampling was designed for a Bayesian setting, it was also recently analyzed in worst-case stochastic setting. More specifically, assume that the reward of each action is a Bernoulli random variable with unknown success probability. Unlike the Bayesian setting, there is no true prior over these parameters (success probabilities), and we want to bound the regret for the worst choice of the parameters. In this setting we start the Thompson sampling algorithm with a fictitious prior, say, a uniform distribution (of the success probability) for each action, and we update the posterior as though we were in the Bayesian setting. The works of [10, 11] show that Thompson sampling guarantees almost optimal regret bounds in the adversarial stochastic setting. Improved regret bounds which are parameter dependent are given in [12].
The papers mentioned above show the great success of Thompson sampling in stochastic settings, thus it is natural to investigate its performance in adversarial online model. In this model TS starts with a fictitious prior and an adversary selects the arbitrary input sequence. The completely adversatial model can be viewed as bounding the regret of the worst-case sequence possible, rather then the expected regret over some distribution in the stochastic settings. Specifically in this paper, our goal is to show that Thompson Sampling is successful for the adversarial bit sequence settings.
Our work considers the performance of Thompson sampling in an adversarial setting. Specifically, we consider the case of adversarial bit prediction, where the learner observes an arbitrary binary sequence, and at each time step predicts the next bit. The loss of the learner is the number of errors it makes, and the regret is the difference between the number of errors the online learner algorithm makes and best static bit prediction, i.e., the minimum between the number of ones and zeros in the sequence. We characterize the bit sequences on which Thompson sampling algorithm has the largest and smallest regret. We bound the regret of these sequences, and show that the worst case regret is , for a sequence of length , and best case regret of .
More specifically, we initialize our Thompson sampling algorithm with a uniform (i.e., ) prior distribution, and maintain a posterior beta distribution (whose parameters correspond to the number of ones and zeros seen so far). To predict the next bit, we draw a value from the beta posterior and predict one if the value is larger than . Once we observe the bit we update our posterior.
For sequences of length with zeros, we show that the sequences with the largest regret are of the form , and the sequence with the smallest regret is (for both sequences and have the same smallest regret). For example, if and , the sequences with the largest regret are and , and the sequence with the smallest regret is . For , we have the same characterization with and [math] interchanged. We also bound the regret of these sequences and show that the expected regret on the worst case sequences is and that the expected regret on the best case sequences is .
We extend the model to have different losses for false positive and false negative errors. Specifically, we have a trade-off parameter and we define the cost of a false positive to be and the cost of a false negative to be . We call this extended model the generalized bit-prediction model. Note that for this loss is simply the number of errors multiplied by , so this is a strict generalization our previous loss. Thompson sampling adapts naturally to the parameter , by simply predicting one when the sampled value is larger than (rather than larger than ). We characterize for each the bit sequences with the largest regret for this model and bound their regret. For example, for sequences of length with zeros and , the worst case sequences are of the form . In general, we show a family of bit-sequences with the highest regret for every trade-off parameter , number of zeros and number of ones. From that we conclude that the regret of Thompson sampling in the adversarial bit-prediction model is bounded by . We also show that there are sequences with regret equals or less then without characterizing the best sequences.
Our work shows the great versatility of Thompson sampling. Namely, the same algorithm, with a prior of , can be analysed in Bayesian setting, when it is given the true prior, in an adversarial stochastic setting, when it is given a fictitious prior, and in the adversarial bit prediction problem, which we analyse in this work. Thompson sampling is not the only algorithm that achieves good performance both for adversarial and stochastic rewards (See, [13, 14, 15]), but it achieves this in a simple natural way, and as a side-product of a general Bayesian methodology, without trying to identify the nature of the environment.
1.1 Other related work
Adversarial bit prediction has a long history, starting with [5], and followed up by many additional works (see, [1]). The exact min-max optimal strategy can be derived, when we view the problem as a zero-sum game (see, [6]). The min-max optimal regret bound for the case of two actions was derived by [5] and for three actions by [16]. Prediction of the next character in non-binary sequences has also received considerable attention, with respect to various benchmarks [17, 18]. For the stochastic case, prediction of the next character in non-binary sequences was studied using Bayesian methods by [19]. Prediction of binary sequences with the log-loss in online adversarial environment has been studied by many due to its relation to data compression and information-theory (see for example, [20], [21] and [22]).
Adversarial online learning and multi-arm bandits have received significant attention in machine learning in the last two decades. (See the following books and surveys, [1, 2, 3, 4]). A lower bound for the adversarial MAB problem was presented by [23]. Notable results in adversarial online learning are the algorithm EXP3 (see, [24]) for adversarial bandits, the algorithm UCB1 (see, [25]) for stochastic bandits, and the regret analysis of the min-max algorithm (see, [26]).
Thompson sampling has been studied in different environments over the years. In [27] it was observed that Thompson sampling with a Gaussian prior is equivalent to "Follow the Perturbed Leader" (FPL) of [28], and that fact was used to deduced the worst case regret of Thompson sampling with Gaussian distributions. A prior-dependent analysis was introduced by [9] using an information-theoretic tools, and the idea was expanded for first and second-order regret bounds by [29].
Thompson sampling also showed good experimental results (see, [30, 31]). Because of that, the algorithm is used in practice, with recommendation systems as an example (see, [32]). In Reinforcement Learning, a version of Thompson sampling called "Posterior Sampling for Reinforcement Learning" (PSRL) is used (see, [33, 34]). Bounds for the algorithm were proved in [35].
2 Model
A bit prediction game proceeds as follows. At time the learner outputs a bit . Then, the learner observes a bit and suffers a loss of .
We compare the loss of the online algorithm to a benchmark, which is the loss of the best static bit prediction. Given a bit sequence , let the number of ones up to be and the number of zeros be . The loss of the best static bit prediction is
[TABLE]
The goal of the learner is to minimize the regret, which is the difference between the online cumulative loss and the loss of the best static bit prediction. Specifically, for an algorithm ,
[TABLE]
where is a fixed bit sequence, and the expectation is taken over the predictions of algorithm . We extend the standard bit prediction game and define a generalized bit prediction game, where the false positive (FP) and false negative (FN) errors have different weights.111A false positive error is when the learner predicts and , and false negative error is when and . Given a trade-off parameter , we define a loss , as follows,
[TABLE]
Namely, the false positive errors are weighted by while the false negative errors are weighted by . Note that for , for any we have that , so for the extended loss is essentially the 0-1 loss.
Similarly, the benchmark for the generalized bit prediction is the best static bit prediction, namely,
[TABLE]
and the regret of algorithm on a given bit sequence is
[TABLE]
2.1 Distributions
We use extensively the Beta distribution, denoted by , where , and the Binomial distribution, denoted by where is the number of trials and is the success probability. We denote by a Bernoulli random variable with success probability . For a distribution , the Cumulative Distribution Function (CDF) is denoted by .
The following identity is a well known fact related to the the Beta distribution (see, [36], Eq. 8.17.4)
Fact 1**.**
For and we have .
The distribution is widely used in Bayesian setting to define the uncertainty over the parameter of a Bernoulli random variable . The distribution , which is the uniform distribution over , is used as the prior distribution of . Given observations of the random variable , where is the number of realizations which are and is the number of realizations which are [math], then the posterior distribution of is (assuming the prior distribution is ).
The following is a well known property of the CDF of the Beta distribution.
Fact 2**.**
[36, Eq. 8.17.20-21]** For every and s.t. , the following holds
[TABLE]
where is the Beta function.
For the analysis we use the following theorems regarding the tail of the distribution, when we fix the parameter and sum over parameters .
Theorem 3**.**
For every we have .
2.2 Notations
When the bit sequence can be inferred from the context, we use and rather than and .
We also define the function as .
For functions we denote iff there exist such that for every .
3 Thompson sampling for bit prediction
The Thompson sampling algorithm requires a prior distribution for its initialization. Given the observations, it updates the prior distribution to a posterior distribution. The learner samples the posterior distribution, and thresholds the sampled value at half (for bit prediction) or (for generalized bit prediction).
More specifically. We consider the prior distribution , which is a uniform distribution over . Note that this prior is fictitious, and used only to initialize the Thompson sampling algorithm. At time the learner samples a value from the distribution , where and are the number of observed ’s and [math]’s up to time , respectively. At time the learner predicts , where is the trade-off parameter of the loss. Then the learner observes the feedback bit and suffers loss . The resulting Thompson sampling algorithm is described in Algorithm 1, and in the analysis we refer to this algorithm as .
In Section 4 we prove the “Swapping Lemma”, which analyses the effect of a single swap on the regret, which allows us to identify the sequences with the largest and smallest regret. In Section 5 we bound the regret of these sequences, thereby obtaining tight upper and lower bounds on the regret. Section 6 addresses the generalized bit prediction case.
4 Swapping Lemma
In this section we compare the regret of two bit sequences which differ by a single swap. This is an essential building block in our analysis of the worst case and the best case regret of the Thompson sampling algorithm.
Swap operation: Given a bit sequence , performing the swap operation at position results in a sequence that swaps and in and keeps all other bits unchanged. Formally, .
The swapping lemma that compares the regret of Thompson sampling, , on the bit sequences and .
To illustrate the swapping lemma consider the case , so . If we had more zeros up to position t-1 then having the one earlier increases the regret. If we had more ones up to position t-1 then having zero earlier increases the regret. More precisely, for each such that , and , swapping and increases the regret. Similarly, if , and then swapping and increases the regret. In other words,
Lemma 4** (Swapping Lemma).**
Fix a bit sequence . For every , such that and , we have
[TABLE]
For every , such that and , we have
[TABLE]
In addition,
[TABLE]
**Proof Sketch ** We consider the difference between the regret of on the bit sequence and on the bit sequence . The two bit sequences differ only at locations and . Since the benchmark of a sequence depends only on the total number of zeros and ones in the sequence, the benchmarks on and are identical, i.e., . Therefore, the difference between the regrets is equals to the difference between the losses at time and .
Consider time such that and .Using the insights above it is easy to show that,
[TABLE]
Using the recurrence relations in Fact 2 we show that,
[TABLE]
Since , we have
[TABLE]
and equality holds iff . The second case, where and , is similar.
5 Regret characterization for
In this section we use the swapping lemma to characterize the sequences on which has the largest and smallest regret. We denote by the number of zeros in the sequence and characterize the sequences of worst and best regret for each . Notice that we may assume that since any sequence has the same regret as the sequence obtained from by flipping each bit. Indeed, and the expected loss of on and is the same (by Fact 1).
5.1 Worst-case regret
Consider bit sequences with zeros, where . We first show that among these bit sequences the ones of largest regret are of the form . Then, we prove that the regret of each of these sequences is .
Theorem 5**.**
For any we have . In addition, for any we have .
Proof.
Note that for any we have . By Lemma 4 this implies that . Since we can transform to by a sequence of swap operations at certain locations , it follows that . This implies that all the sequences of the form have the same regret.
Let be a bit sequence of length with zeros such that . We show that for some , the sequence has a regret larger than .
Since , there is an index such that either or . Let to be the smallest such index. Assume that . (The case of is similar.) It follows that and . Let be the minimal index such that . Such an index must exist, since there are zeros in and until index there were only zeros. Since we have . By Lemma 4, the sequence has regret higher than , i.e., .
Since there are finite number of bit sequences of length with zeros, we get that sequences with the largest regret must be of the form . ∎
Given the above theorem, to bound the worst case regret of , we can focus on the sequence and bound .
Theorem 6**.**
For every and we have, .
**Proof Sketch ** Let , where we have: (1) for , (2) for , and (3) for . We bound the expected number of errors made by on each of these three subsets. Then, from these bounds we derive a bound on the loss and the regret. Specifically we prove the following:
For , and thus the probability to predict the next bit is . Therefore, the expected number of false positive errors in is
[TABLE] 2. 2.
For , and the difference between the probability to predict 0 and the probability to predict 1 is small and can be bounded. Therefore, the expected number of false negative errors in is
[TABLE] 3. 3.
The expected number of false negative in is show to be
[TABLE]
where the last equality follows from Theorem 3.
Summing up the errors over , , and , and recalling that the static prediction makes errors, we bound the regret as follows
[TABLE]
Since , we have the following corollary.
Corollary 7**.**
For any sequence of length , the regret of is at most .
Remark 8**.**
Note that in fact we proved that .
5.2 Best-case regret
In this subsection, we characterize the sequences with the lowest regret and bound them.
Theorem 9**.**
The bit sequence with the lowest regret of length with zeros is . For , both and have the lowest regret.
We now bound the regret of .
Theorem 10**.**
For every and we have, , where .
6 Regret characterization for a general
To get some intuition regarding this generalization to an arbitrary trade-off parameter consider the following simple example. Assume that , and thereby and lets construct a sequence such that we cannot increase the regret by swapping any pair of consecutive bits. This sequence cannot start with a , since if it does then by the swapping lemma (Lemma 4 we will be able to increase the regret by swapping the first [math] with the preceding it. So we must start with a [math]. In general we determine bit by comparing to (i.e., ). If they are equal then the bit in position is either [math] or . If the bit in position is [math] since otherwise we will be able to increase the regret by swapping the first [math] following position with its preceding . Similarly, if the bit in position is since otherwise we will be able to increase the regret by swapping the first following position with its preceding [math].
It follows that the second bit could be either [math] or since . If we have a [math] at position then and therefore we must continue with a at position . Then we have that so we put [math] at position , and we are back in the situation where so we can choose either [math] or at position . Similarly, if we place a at position then we will have to continue with two [math]’s and then we will be free to choose at position either [math] or . It follows that the family of sequences of the form (where could be any prefix of or ) contains all sequences of largest regret. (We will in fact show that they all have the same regret.)
To gain some deeper intuition assume now that is a rational number and (where and do not have common divisors) and lets try to construct a sequence that we cannot increase its regret by applying the swapping lemma. Whenever we can choose any bit to position . At this point we have that and therefore is a multiple of and is a multiple of . Once we choose, say [math], then we are forced to choose a particular sequence in the following steps, until we will again have that for among these bits would be zeros and would be ones so .
The structure of this section is similar to the structure of Section 5. First, we characterize the bit sequences of largest regret. Then, we bound the regret of these sequences.
6.1 Worst-case sequences
Consider the following function that maps a bit-sequence to a set of bits
[TABLE]
where is the total number of s in and is the total number of [math]s in .
For every sequence we define to be the largest index s.t. , where . We call a bit sequence a worst-case sequence if . We define the subsequence as the *head * of and denote it and the subsequence as the tail of and denote it .
For start, we characterize the tail of a worst-case sequence.
Theorem 11**.**
Let be a worst-case sequence. If then the is filled with ones. Otherwise, the is filled with zeros.
6.2 Worst-case regret
In this subsection we prove that all the worst-case sequences have the largest regret and prove an upper bound on this regret.
Theorem 12**.**
Let , s.t. is not a worst-case sequence. Then, there exists such that .
Proof.
Let . Since is not a worst-case sequence, there is an index such that (since, from Theorem 11, contains both [math]’s and ’s). Assume is the smallest index with this property.
Case 1 Assume and . Since we have . From the definition of follows that and thus . By Lemma 4, the sequence has a regret larger than .
Case 2 Assume and . Since we have . From the definition of follows that and thus . By Lemma 4, the sequence has a regret larger than . ∎
Theorem 12 implies that any sequence of largest regret is a worst-case sequence. Next we prove that all worst-case sequences of length with zeros have the same regret.
Lemma 13**.**
All the worst-case sequences of length with zeros have the same regret.
Let be a worst-case sequence with zeros such that for all with we have . Since by Lemma 13 all the worst-case sequences with the same number of zeros have the same regret, we can focus on bounding the regret of .
Theorem 14**.**
For every , and zeros we have
[TABLE]
The regret bounds for are derived from the Theorem 14 using the following lemma.
Lemma 15**.**
For every bit sequence define . Then,
The following theorem derives the worst-case sequences regret bound for general .
Theorem 16**.**
For any observation sequence of length , the regret of is .
6.3 Best-case regret bound
We do not characterize the exact best-case regret sequences222Finding the best-case sequence characterization for a general trade-off parameter is harder than the previous cases. With the tools we presented, it is difficult even to compare the regrets of the bit sequences and for ., but only show that there are sequence with regret at most .
Theorem 17**.**
*For every and , if , then and otherwise . *
Acknowledgments
This work was supported in part by the Yandex Initiative in Machine Learning and by a grant from the Israel Science Foundation (ISF).
Appendix A Beta and Binomial concentration bounds
The following identities are well known (see, for example, [10], Fact 3 and [36], Eq. 8.17.4).
The first relates the CDFs of the Beta and the Binomial distributions. The second is a property of the Beta distribution.
Fact 18**.**
For and we have .
Fact 19**.**
For and we have .
Next, we present concentration bounds and inequalities that we need for our proofs.
Fact 20**.**
(Gaussian Half CDF)
Let . Then .
Fact 21**.**
(Multiplicative Chernoff bound) [37]
Let be random variables with values of such that . Let .
For , . 2. 2.
For , .
Fact 22**.**
(Chernoff-Hoeffding) [38]
Let be random variables with common range such that . Let .
For all , . 2. 2.
For and , .
Appendix B Proof of bounds on sums of Beta CDFs (Theorems 3 and 25)
We present two bounds for sums of Beta CDFs. In the first subsection we prove a simple version of our bound, which appears Theorem 3. In the second subsection we expend the result to a general .
B.1 Proof of Theorem 3
The proof is divided into two parts. First we prove a bound on a series of exponents and then use Hoeffding bound to show that the exponent series is an upper bound for the sum of beta-distribution CDFs appears in Theorem 3.
Lemma 23**.**
For every , .
Proof.
Let , then
[TABLE]
We bound from below and above the exponents. For the upper bound we use the fact that and for lower bounding the exponent we consider two cases: (a) and (b) . We have,
[TABLE]
We bound the sum (2) from below using Fact 20, where , as follows
[TABLE]
For upper bounding Eq. (2) we have,
[TABLE]
The first sum of the right side of Eq. (3) is bounded, by using Fact 20 with , as follows
[TABLE]
The second sum of the right hand side of Eq. (3) is an exponential sum and bounded as follows,
[TABLE]
By combining the previous inequalities and Eq. (3) we get . ∎
See 3
Proof.
Using Fact 18
[TABLE]
Note that when , therefore we can use the Chernoff-Hoffding bound (Fact 22.1) to achieve
[TABLE]
where the last equality follows from Lemma 23. ∎
B.2 Proof of Theorem 25
The following subsection generalizes the proof of Theorem 3, as presented in Appendix B.1. We divide the generalized theorem version proof into two parts similarly to Appendix B.1.
Lemma 24**.**
For every , and we have
, 2. 2.
* .*
Proof.
1. We bound the sum as follows
[TABLE]
Using a substitution of ,
[TABLE]
We bound the exponent from below by considering two cases and . We have,
[TABLE]
Hence, we have
[TABLE]
We bound the first integral of Eq. (5) using Fact 20, where , as follows
[TABLE]
The second integral in Eq. (5) equals
[TABLE]
[TABLE]
2. We bound the sum as follows
[TABLE]
Using a substitution of ,
[TABLE]
∎
Theorem 25**.**
For every and we have
[TABLE]
Proof.
Using Fact 18
[TABLE]
Let and . We have and therefore, we rewrite Eq. (8) as
[TABLE]
1. First, we focus on the case of .
Consider and notice that when , which is equivalent to . Also, we note that . Using Chernoff bound (Fact 21.1) and Lemma 24.1, with , we have
[TABLE]
When we use the second form of Chernoff bound (Fact 21.2), followed by Lemma 24.2, with , to have
[TABLE]
When we can assume worst-case to get
[TABLE]
By substituting Eq. (10-12) in Eq. (9) we have
[TABLE]
Since , we have , thus
[TABLE]
2. Now, consider . Assume and therefore . Using Hoeffding bound (Fact 22.2) we get that
[TABLE]
Thus, by using Lemma 24.1, with , we have
[TABLE]
For we assume the worst-case bound to get
[TABLE]
By substituting Eq. (B.2, 14) in Eq. (9) and using Lemma 24.1, with , to have
[TABLE]
∎
Appendix C Proof of the Swapping Lemma (Lemma 4)
We start with the following preliminary lemma that states the probability of an error for given a history.
Lemma 26**.**
Fix a bit sequence . For any we have,
[TABLE]
Proof.
At time , algorithm samples , and predicts if and if . Thus, for the case of ,
[TABLE]
and for the case of ,
[TABLE]
∎
Now we can prove the Swapping Lemma, which compares the regret of two sequences that differ by a single swap operation.
See 4
Proof.
We consider the difference between the regret of on the bit sequence and the bit sequence . The two bit sequences differ only at locations and . Since the benchmark of a sequence depends only on the total number of zeros and ones in the sequence, the benchmarks on and are identical, i.e., . Therefore, the difference between the regrets is equals to the loss difference at time and .
Consider time such that and .We have,
[TABLE]
where we used Lemma 26 for the equality before last.
By Fact 2, we have the following recurrence relations:
[TABLE]
where is the Beta function. Therefore,
[TABLE]
We now analyse the of the terms in Eq. (15). Since ,
[TABLE]
Thus,
[TABLE]
and equality holds iff .
The second case, where and , is similar. ∎
Appendix D Worst-case regret proofs for (Section 5.1)
Consider bit sequences with zeros, where zeros. We first show that among these bit sequences the ones of largest regret are of the form . Then, we prove that the regret of each of these sequences is .
See 5
Given the above theorem, to bound the worst case regret of , we can focus on the sequence and bound .
See 6
Proof.
Let , where we have: (1) for , (2) for , and (3) for . We bound the expected number of errors made by on each of these three subsets. Then, from these bounds we derive a bound on the loss and the regret.
The expected number of false positive errors in : Note that the only errors at times are false positive since for these ’s. For we have that , and . Hence the algorithm predicts and each with probability of and
[TABLE]
When we sum over , we have
[TABLE]
The expected number of false negative errors in : Note that the only errors at times are false negatives since . For we have , and and . By Lemma 26 and Fact 18 we have
[TABLE]
We can bound using Fact 29, in the following way
[TABLE]
Summing over we have,
[TABLE]
The expected number of false negative in : Note that the only errors at times are false negative since for these ’s. For any we have . Therefore,
[TABLE]
From Theorem 3 we have
[TABLE]
Summing up the errors over , , and we get that the total number of errors is
[TABLE]
Recall that the regret is the total loss minus the best static bit prediction. Since we assume that it is equal to
[TABLE]
∎
Appendix E Best-case regret proofs for (Section 5.2)
We show that for , the lowest regret is for the bit sequence . Then, we prove that its regret is for any .
Lemma 27**.**
For any , .
Proof.
Let and . We show, using Lemma 26, that for each , we have , which implies that and have the same expected loss. Since static bit prediction also has the same loss on and then they have the same regret.
For , by Fact 1, we have
[TABLE]
For we have,
[TABLE]
For we have and and thus . ∎
From that we can induce that has the lowest regret on .
See 9
Proof.
Let be a bit sequence of length with zeros such that . We show that there is a bit sequence , that has the same regret as , and for some the sequence has regret smaller than .
Since , then either or it has a prefix of the form or , where .
First, we look at the case where . By Lemma 27, the sequence has the same regret as and by Lemma 4, the sequence has regret smaller than the regret of .
Second, assume has a prefix of (the case of is similar). We have two sub-cases: (a) If then and , . By Lemma 4, the sequence has regret lower than . (b) If , by Lemma 27, the bit sequences and have the same regret. By Lemma 4, the sequence has regret smaller than the regret of .
For , by Lemma 27, both and have the same regret. ∎
We now bound the regret of .
See 10
Proof.
For we have . Thus
[TABLE]
Using Fact 19, we have
[TABLE]
This implies that the expected number of false negative errors, in steps , is
[TABLE]
For we can have at most errors so
[TABLE]
Therefore, the regret of on is bounded by
[TABLE]
∎
Appendix F Worst-case regret proofs for a general (Sections 6.1 and 6.2)
Recall ,
[TABLE]
where is the total number of s in and is the total number of [math]s in . For every sequence we define to be the largest index s.t. , where . We call a bit sequence a worst-case sequence if . We define the subsequence as the *head * of and denote it and the subsequence as the tail of and denote it .
For start, we want to bound the number of [math]s and s in the head of a worst-case sequence.
Lemma 28**.**
Fix a worst-case sequence and let . Then, if then and , if then and .
Proof.
The proof is by induction on . For and we have that and therefore of an empty sequence equals . Thus, as , we must place . In case of such sequence and .
By the induction hypothesis for both and we have, and .
Case 1 . Since , we have that and therefore . Since and we get that
[TABLE]
Since we can substitute in Eq. (17) and get that . Similarly by substituting in Eq. (17) we get that . The upper bound on and the lower bound on follow directly from our assumption: and .
Case 2 . Since , we have that and therefore . Since and we get that
[TABLE]
Since we can substitute in Eq. (18) and get that . Similarly by substituting in Eq. (18) we get that . The lower bound on and the upper bound on follow directly from our assumption: and . ∎
From Lemma 28 we characterize the tail of a worst-case sequence.
See 11
Proof.
Let .
Consider first the case where and assume by contradiction that is not empty and it is filled with zeros. It follows from this assumption that . By Lemma 28 we have that , and by combining this inequality with the equality we get that . On the other hand we assumed that . So by combining these upper and lower bounds on we get that and thus . This is a contradiction to the assumption that is not empty.
Consider now the case where and assume by contradiction that is not empty and it is filled with ones. It follows from this assumption that . By Lemma 28 we have that , and by combining this inequality with the equality we get that . On the other hand we assumed that . So by combining these upper and lower bounds on we get that and thus . This is a contradiction to the assumption that is not empty. ∎
Now we prove that all the worst-case sequences have the largest regret and bound it.
See 12
Proof.
Let . Since is not a worst-case sequence, there is an index such that (since contains both [math]’s and ’s). Assume is the smallest index with this property.
Case 1 Assume and . Since we have . From the definition of follows that and thus . By Lemma 4, the sequence has a regret larger than .
Case 2 Assume and . Since we have . From the definition of follows that and thus . By Lemma 4, the sequence has a regret larger than . ∎
Theorem 12 implies that any sequence of largest regret is a worst-case sequence. Next we prove that all worst-case sequences of length with zeros have the same regret.
See 13
Proof.
Assume by contradiction that there are two worst-case sequences such that
, and . We assume further that and have the longest common prefix among all worst-case sequences of length with zeros and regret and , respectively.
Since and both have zeros then by Corollary 11 their tails are filled with the same bit. It follows that . Assume without loss of generality that is not shorter than . We claim that is not a prefix of . This follows since otherwise and cannot both have zeros.
It follows that there exists an index such that . Let be the smallest such index. Since we have that . Assume that and . Therefore, there is an index such that and . Since the tails of both sequences are filled with the same bit then this implies that and therefore since we have that .
Since we have that , and since we must have that . By Lemma 4, . It is easy to check that is still a worst-case sequence and since it has a longer common prefix with we get a contradiction to the choice of and .
The case where and is analogous. ∎
Let be a worst-case sequence with zeros such that for all with we have . Since by Lemma 13 all the worst-case sequences with the same number of zeros have the same regret, we can focus on bounding the regret of .
See 14
Proof.
We first consider the case that . We partition into the following sets (1) , (2) , and (3) . We bound the expected number of errors made by on each of these three subsets. Then, from these bounds we derive a bound on the loss and the regret.
The expected number of false positive errors in : Note that the only errors at times are false positive since for these ’s. Therefore, by Lemma 26 and Fact 18 we have
[TABLE]
By the definition of , , and therefore by Lemma 28, . Thus, . Let and . We can bound the right side of Eq. (F) as follows.
[TABLE]
We now bound the different probabilities in Eq. (F). Since is a Binomial random variable, its median is or and thereby
[TABLE]
For any , we bound by Lemma 30 as follows
[TABLE]
The probability is bounded using the previous equality,
[TABLE]
Therefore by using Eq. (22) and (F) we have
[TABLE]
By substituting Eq. (F-22,24) into (F) we get that for
[TABLE]
For we assume the worst-case to have
[TABLE]
Notice that since , by Corollary 11 there are no zeros in the tail.Thus, all the zeros of are in . Thus, we use Eq. (25-26) to sum over all .
[TABLE]
The expected number of false negative errors in : Note that the only errors at times are false negative since . Therefore, by Lemma 26 and Fact 18 we have
[TABLE]
By the definition of , , and therefore by Lemma 28, . Thus, . Let and . We can continue and bound the right side of Equation (28) as follows.
[TABLE]
Note that we have analogous bounds to the previous case of , since by substituting and by and respectively in Eq. (F,F) we get Eq. (28,F). Thereby,
[TABLE]
Since contains all the zeros in we have . By using Lemma 28 we get that and thus . Therefore, .
Let . By Eq. (30), we sum over all to have
[TABLE]
where the one before last inequality follows from substitution of .
The expected number of false negative in : By Corollary 11 the only errors at times are false negative since . For any we have . Therefore,
[TABLE]
From Lemma 28, and thus . From Theorem 25 we have
[TABLE]
where the inequality follows from .
Since , the best static bit predictor is
[TABLE]
By using Eq. (F), (31) and (F), the regret is the total loss minus the best static bit prediction
[TABLE]
We now look at the regret for . In this proof, we split the calculations into and as in the prior part.
The expected number of false positive errors in : At each the expected errors are bounded in the same way as in the previous case. The only change is the size of . Notice that since , by Corollary 11 all ones of are in . By the definition of , , and therefore by Lemma 28, and thus . From Lemma 28 we also conclude that. In total, . Thereby, the expected number of errors in is bounded by .
The expected number of false negative errors in : At each the expected errors are bounded in the same way as in the previous case. The only change is the size of , which equals to since from Corollary 11 all the ones of are in . Thus we have that the expected number of errors is bounded by .
The expected number of false negative in By Corollary 11 the only errors at times are false negative since . For any we have . Therefore,
[TABLE]
From Lemma 28, and thus . From Theorem 25, since , we have
[TABLE]
Since , the best static bit predictor is
[TABLE]
Hence, the regret in the case is
[TABLE]
∎
See 15
Proof.
Fix and a bit sequence . We show that . At each step , . Therefore by Fact 1 we have
[TABLE]
The benchmarks are the same as,
[TABLE]
We conclude that . ∎
See 16
Proof.
Assume . From Theorem 12 the bit sequences that generate the largest regret, with zeros, are worst-case sequences. Theorem 14 shows that the regret of these bit sequences is
[TABLE]
Thus, the worst-case regret over all ’s is
[TABLE]
For , Lemma 15 with Theorem 14 gives us the same regret of . ∎
Appendix G Best-case regret proofs for a general (Section 6.3)
See 17
Proof.
First we calculate the loss of . For we have . Thus, by using Lemma 26,
[TABLE]
Using Fact 19, we have
[TABLE]
This implies that the expected number of false negative errors, in steps , is
[TABLE]
For we can have at most errors so
[TABLE]
Therefore, the expected loss of on is bounded by
[TABLE]
Analogously, we bound the expected loss of on by
[TABLE]
The benchmark of the two sequences is the same and equals
[TABLE]
Therefore, if then by Eq. (G)
[TABLE]
Otherwise and by Eq. (34)
[TABLE]
∎
Appendix H Binomial
coefficient approximations
We use the following well known approximation of the Binomial coefficient using Stirling’s approximation. (see for example, [39])
Fact 29**.**
For every and we have
[TABLE]
From the fact above we conclude the following lemma.
Lemma 30**.**
For every constant and , we have
[TABLE]
Proof.
Let . We bound using Fact 29 as follows
[TABLE]
From the definition of floor and therefore
[TABLE]
Since we have
[TABLE]
Since we get that and therefore . Thus, by using Eq. (H),
[TABLE]
We bound as follow
[TABLE]
where the last inequality holds as is a monotonic increasing function and since , the function has a minimum at .
[TABLE]
∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Nicolò Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games . Cambridge University Press, 2006.
- 2[2] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1–122, 2012.
- 3[3] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. http://downloads.tor-lattimore.com/banditbook/book.pdf, 2019.
- 4[4] Aleksandrs Slivkins. Introduction to multi-armed bandits. ar Xiv preprint ar Xiv:1904.07272 , 2019.
- 5[5] T.M. Cover. Behavior of sequential predictors of binary sequences. In Transactions of the Fourth Prague Conference on Information Theory , 1966.
- 6[6] Alexander Rakhlin and Karthik Sridharan. Statistical learning and sequential prediction. http://www.mit.edu/ rakhlin/courses/stat 928/stat 928_notes.pdf, 2014.
- 7[7] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3–4):285–294, 1933.
- 8[8] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning , 11(1):1–96, 2018.
