Thompson Sampling for Adversarial Bit Prediction

Yuval Lewi; Haim Kaplan; Yishay Mansour

arXiv:1906.09059·cs.LG·January 1, 2020

Thompson Sampling for Adversarial Bit Prediction

Yuval Lewi, Haim Kaplan, Yishay Mansour

PDF

Open Access

TL;DR

This paper analyzes the performance of Thompson sampling in adversarial bit prediction, identifying sequences with minimal and maximal regret, and extending results to weighted error models.

Contribution

It characterizes adversarial sequences with extreme regret bounds and extends analysis to weighted false positive and false negative errors.

Findings

01

Sequences with alternating bits have maximal regret of O(√T).

02

Sequences of all ones or zeros have minimal regret of O(1).

03

Results extend to models with weighted false positives and negatives.

Abstract

We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length $T$ with $k < \frac{T}{2}$ zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worse case sequences have regret $O (T)$ and the best case sequence have regret $O (1)$ . We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and derive their regret bounds. We also show that there are sequences with $O (1)$ regret.

Equations281

s t a t i c (Γ) = min {t = 1 \sum T ℓ (1, γ_{t}), t = 1 \sum T ℓ (0, γ_{t})} = min {Z_{T} (Γ), O_{T} (Γ)} .

s t a t i c (Γ) = min {t = 1 \sum T ℓ (1, γ_{t}), t = 1 \sum T ℓ (0, γ_{t})} = min {Z_{T} (Γ), O_{T} (Γ)} .

R e g r e t_{A} (Γ) = t = 1 \sum T E_{\overset{γ}{^}_{t} \sim A} [ℓ (\overset{γ}{^}_{t}, γ_{t}) ∣ Γ] - s t a t i c (Γ),

R e g r e t_{A} (Γ) = t = 1 \sum T E_{\overset{γ}{^}_{t} \sim A} [ℓ (\overset{γ}{^}_{t}, γ_{t}) ∣ Γ] - s t a t i c (Γ),

ℓ^{q} (\overset{γ}{^}_{t}, γ_{t})

ℓ^{q} (\overset{γ}{^}_{t}, γ_{t})

s t a t i c^{q} (Γ) = min {t = 1 \sum T ℓ^{q} (1, γ_{t}), t = 1 \sum T ℓ^{q} (0, γ_{t})} = min {q Z_{T} (Γ), (1 - q) O_{T} (Γ)},

s t a t i c^{q} (Γ) = min {t = 1 \sum T ℓ^{q} (1, γ_{t}), t = 1 \sum T ℓ^{q} (0, γ_{t})} = min {q Z_{T} (Γ), (1 - q) O_{T} (Γ)},

R e g r e t_{A}^{q} (Γ) = t = 1 \sum T E_{\overset{γ}{^}_{t} \sim A} [ℓ^{q} (\overset{γ}{^}_{t}, γ_{t}) ∣ Γ] - s t a t i c^{q} (Γ) .

R e g r e t_{A}^{q} (Γ) = t = 1 \sum T E_{\overset{γ}{^}_{t} \sim A} [ℓ^{q} (\overset{γ}{^}_{t}, γ_{t}) ∣ Γ] - s t a t i c^{q} (Γ) .

F_{β (a + 1, b)} (x) = F_{β (a, b)} - \frac{x ^{a} ( 1 - x ) ^{b}}{a B ( a , b )} \mbox an d F_{β (a, b + 1)} (x) = F_{β (a, b)} (x) + \frac{x ^{a} ( 1 - x ) ^{b}}{b B ( a , b )}

F_{β (a + 1, b)} (x) = F_{β (a, b)} - \frac{x ^{a} ( 1 - x ) ^{b}}{a B ( a , b )} \mbox an d F_{β (a, b + 1)} (x) = F_{β (a, b)} (x) + \frac{x ^{a} ( 1 - x ) ^{b}}{b B ( a , b )}

R e g r e t_{T S (q)}^{q} (Γ) < R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} > \frac{O _{t - 1} + 1}{Z _{t - 1} + 1} .

R e g r e t_{T S (q)}^{q} (Γ) < R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} > \frac{O _{t - 1} + 1}{Z _{t - 1} + 1} .

R e g r e t_{T S (q)}^{q} (Γ) < R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} < \frac{O _{t - 1} + 1}{Z _{t - 1} + 1} .

R e g r e t_{T S (q)}^{q} (Γ) < R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} < \frac{O _{t - 1} + 1}{Z _{t - 1} + 1} .

R e g r e t_{T S (q)}^{q} (Γ) = R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} = \frac{O _{t - 1} + 1}{Z _{t - 1} + 1} .

R e g r e t_{T S (q)}^{q} (Γ) = R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} = \frac{O _{t - 1} + 1}{Z _{t - 1} + 1} .

R

R

= (1 - q) F_{β (O_{t - 1} + 1, Z_{t - 1} + 2)} (q) + q F_{β (O_{t - 1} + 2, Z_{t - 1} + 1)} (q) - F_{β (O_{t - 1} + 1, Z_{t - 1} + 1)} (q),

R e g r e t_{T S (q)}^{q} (Γ) - R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) = \frac{q ^{O_{t - 1} + 1} ( 1 - q ) ^{Z_{t - 1} + 1}}{B ( O _{t - 1} + 1 , Z _{t - 1} + 1 )} (\frac{1 - q}{Z _{t - 1} + 1} - \frac{q}{O _{t - 1} + 1}),

R e g r e t_{T S (q)}^{q} (Γ) - R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) = \frac{q ^{O_{t - 1} + 1} ( 1 - q ) ^{Z_{t - 1} + 1}}{B ( O _{t - 1} + 1 , Z _{t - 1} + 1 )} (\frac{1 - q}{Z _{t - 1} + 1} - \frac{q}{O _{t - 1} + 1}),

R e g r e t_{T S (q)}^{q} (Γ) < R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} > \frac{O _{t - 1} + 1}{Z _{t - 1} + 1},

R e g r e t_{T S (q)}^{q} (Γ) < R e g r e t_{T S (q)}^{q} (S w a p (Γ, t)) ⟺ \frac{q}{1 - q} > \frac{O _{t - 1} + 1}{Z _{t - 1} + 1},

t = 1 \sum k E [I {\overset{γ}{^}_{t} \neq = w_{t}} ∣ W_{T}^{k}] = \frac{k}{2} .

t = 1 \sum k E [I {\overset{γ}{^}_{t} \neq = w_{t}} ∣ W_{T}^{k}] = \frac{k}{2} .

i = 1 \sum k E [I {\overset{γ}{^}_{2 i} \neq = w_{2 i}} ∣ W_{T}^{k}] = \frac{k}{2} + Θ (k) .

i = 1 \sum k E [I {\overset{γ}{^}_{2 i} \neq = w_{2 i}} ∣ W_{T}^{k}] = \frac{k}{2} + Θ (k) .

t = 2 k + 1 \sum T E [I {\overset{γ}{^}_{t} \neq = w_{t}} ∣ W_{T}^{k}] = t = 2 k + 1 \sum T F_{β (t - k + 1, k + 1)} (\frac{1}{2}) = O (k),

t = 2 k + 1 \sum T E [I {\overset{γ}{^}_{t} \neq = w_{t}} ∣ W_{T}^{k}] = t = 2 k + 1 \sum T F_{β (t - k + 1, k + 1)} (\frac{1}{2}) = O (k),

t = 1 \sum T E [I {\overset{γ}{^}_{t} \neq = w_{t}} ∣ W_{T}^{k}] - min {T - k, k} = \frac{k}{2} + (\frac{k}{2} + Θ (k)) + O (k) - k = Θ (k) .

t = 1 \sum T E [I {\overset{γ}{^}_{t} \neq = w_{t}} ∣ W_{T}^{k}] - min {T - k, k} = \frac{k}{2} + (\frac{k}{2} + Θ (k)) + O (k) - k = Θ (k) .

\forallΦ \in {0, 1}^{*} : H^{q} (Φ) = ⎩ ⎨ ⎧ {0} {1} {0, 1} \frac{O ( Φ ) + 1}{Z ( Φ ) + 1} > \frac{q}{1 - q} \frac{O ( Φ ) + 1}{Z ( Φ ) + 1} < \frac{q}{1 - q} \frac{O ( Φ ) + 1}{Z ( Φ ) + 1} = \frac{q}{1 - q},

\forallΦ \in {0, 1}^{*} : H^{q} (Φ) = ⎩ ⎨ ⎧ {0} {1} {0, 1} \frac{O ( Φ ) + 1}{Z ( Φ ) + 1} > \frac{q}{1 - q} \frac{O ( Φ ) + 1}{Z ( Φ ) + 1} < \frac{q}{1 - q} \frac{O ( Φ ) + 1}{Z ( Φ ) + 1} = \frac{q}{1 - q},

R e g r e t_{T S (q)}^{q} (W_{T}^{k}) = {O (q k) O ((1 - q) (T - k)) k \leq (1 - q) T - q k > (1 - q) T - q .

R e g r e t_{T S (q)}^{q} (W_{T}^{k}) = {O (q k) O ((1 - q) (T - k)) k \leq (1 - q) T - q k > (1 - q) T - q .

i = n + 1 \sum \infty e^{- \frac{( i - ( n + 1 ) ) ^{2}}{2 ( i + n + 1 )}} = j = 0 \sum \infty e^{- \frac{j ^{2}}{2 ( j + 2 ( n + 1 ) )}} .

i = n + 1 \sum \infty e^{- \frac{( i - ( n + 1 ) ) ^{2}}{2 ( i + n + 1 )}} = j = 0 \sum \infty e^{- \frac{j ^{2}}{2 ( j + 2 ( n + 1 ) )}} .

\frac{j^{2}}{4(n+1)}\geq\frac{j^{2}}{2\left(j+2\left(n+1\right)\right)}\geq\left\{\begin{array}[]{ll}\frac{j^{2}}{8\left(n+1\right)}&2(n+1)\geq j\geq 0\\ \frac{j}{4}&j>2(n+1)\\ \end{array}\right..

\frac{j^{2}}{4(n+1)}\geq\frac{j^{2}}{2\left(j+2\left(n+1\right)\right)}\geq\left\{\begin{array}[]{ll}\frac{j^{2}}{8\left(n+1\right)}&2(n+1)\geq j\geq 0\\ \frac{j}{4}&j>2(n+1)\\ \end{array}\right..

j = 0 \sum \infty e^{- \frac{j ^{2}}{2 ( j + 2 ( n + 1 ) )}} \geq j = 0 \sum \infty e^{- \frac{j ^{2}}{4 ( n + 1 )}} \geq 4 π (n + 1) \frac{1}{4 π ( n + 1 )} 0 \int \infty e^{- \frac{x ^{2}}{4 ( n + 1 )}} d x = π (n + 1) .

j = 0 \sum \infty e^{- \frac{j ^{2}}{2 ( j + 2 ( n + 1 ) )}} \geq j = 0 \sum \infty e^{- \frac{j ^{2}}{4 ( n + 1 )}} \geq 4 π (n + 1) \frac{1}{4 π ( n + 1 )} 0 \int \infty e^{- \frac{x ^{2}}{4 ( n + 1 )}} d x = π (n + 1) .

j = 0 \sum \infty e^{- \frac{j ^{2}}{2 ( j + 2 ( n + 1 ) )}} \leq j = 0 \sum 2 (n + 1) e^{- \frac{j ^{2}}{8 ( n + 1 )}} + j = 2 (n + 1) \sum \infty e^{- \frac{j}{4}} .

j = 0 \sum \infty e^{- \frac{j ^{2}}{2 ( j + 2 ( n + 1 ) )}} \leq j = 0 \sum 2 (n + 1) e^{- \frac{j ^{2}}{8 ( n + 1 )}} + j = 2 (n + 1) \sum \infty e^{- \frac{j}{4}} .

j = 0 \sum 2 (n + 1) e^{- \frac{j ^{2}}{8 ( n + 1 )}} \leq 1 + 0 \int 2 (n + 1) e^{- \frac{x ^{2}}{8 ( n + 1 )}} d x \leq 1 + 2 π (n + 1) .

j = 0 \sum 2 (n + 1) e^{- \frac{j ^{2}}{8 ( n + 1 )}} \leq 1 + 0 \int 2 (n + 1) e^{- \frac{x ^{2}}{8 ( n + 1 )}} d x \leq 1 + 2 π (n + 1) .

j = 2 (n + 1) \sum \infty e^{- \frac{j}{4}} = \frac{1}{1 - e ^{- \frac{1}{4}}} - \frac{1 - ( e ^{- \frac{1}{4}} ) ^{2 n + 3}}{1 - e ^{- \frac{1}{4}}} \leq \frac{1}{1 - e ^{- \frac{1}{4}}} .

j = 2 (n + 1) \sum \infty e^{- \frac{j}{4}} = \frac{1}{1 - e ^{- \frac{1}{4}}} - \frac{1 - ( e ^{- \frac{1}{4}} ) ^{2 n + 3}}{1 - e ^{- \frac{1}{4}}} \leq \frac{1}{1 - e ^{- \frac{1}{4}}} .

i = n + 1 \sum \infty F_{β (i + 1, n + 1)} (\frac{1}{2})

i = n + 1 \sum \infty F_{β (i + 1, n + 1)} (\frac{1}{2})

= i = n + 1 \sum \infty (1 - x_{j} \sim B er (\frac{1}{2}) Pr (j = 1 \sum i + n + 1 x_{j} \leq i))

= i = n + 1 \sum \infty x_{j} \sim B er (\frac{1}{2}) Pr (j = 1 \sum i + n + 1 x_{j} - \frac{i + n + 1}{2} \geq \frac{i - ( n + 1 )}{2}) .

i = n + 1 \sum \infty F_{β (i + 1, n + 1)} (\frac{1}{2}) \leq 2 i = n + 1 \sum \infty e^{- \frac{( i - ( n + 1 ) ) ^{2}}{2 ( i + n + 1 )}} = Θ (n) .

i = n + 1 \sum \infty F_{β (i + 1, n + 1)} (\frac{1}{2}) \leq 2 i = n + 1 \sum \infty e^{- \frac{( i - ( n + 1 ) ) ^{2}}{2 ( i + n + 1 )}} = Θ (n) .

i = ⌈ \frac{p}{1 - p} (n + 1) ⌉ + 1 \sum \infty e^{- \frac{(( 1 - p ) i - p ( n + 1 ) ) ^{2}}{a ( i + n + 1 )}} \leq \frac{p}{1 - p} (n + 1) \int \infty e^{- \frac{(( 1 - p ) x - p ( n + 1 ) ) ^{2}}{a ( x + n + 1 )}} d x .

i = ⌈ \frac{p}{1 - p} (n + 1) ⌉ + 1 \sum \infty e^{- \frac{(( 1 - p ) i - p ( n + 1 ) ) ^{2}}{a ( i + n + 1 )}} \leq \frac{p}{1 - p} (n + 1) \int \infty e^{- \frac{(( 1 - p ) x - p ( n + 1 ) ) ^{2}}{a ( x + n + 1 )}} d x .

\frac{p}{1 - p} (n + 1) \int \infty e^{- \frac{(( 1 - p ) x - p ( n + 1 ) ) ^{2}}{a ( x + n + 1 )}} d x \leq \frac{1}{1 - p} 0 \int \infty e^{- \frac{y ^{2}}{a ( \frac{y + p ( n + 1 )}{1 - p} + n + 1 )}} d y = \frac{1}{1 - p} 0 \int \infty e^{- \frac{1 - p}{a ( y + n + 1 )} y^{2}} d y .

\frac{p}{1 - p} (n + 1) \int \infty e^{- \frac{(( 1 - p ) x - p ( n + 1 ) ) ^{2}}{a ( x + n + 1 )}} d x \leq \frac{1}{1 - p} 0 \int \infty e^{- \frac{y ^{2}}{a ( \frac{y + p ( n + 1 )}{1 - p} + n + 1 )}} d y = \frac{1}{1 - p} 0 \int \infty e^{- \frac{1 - p}{a ( y + n + 1 )} y^{2}} d y .

\frac{1-p}{a(y+n+1)}y^{2}\geq\left\{\begin{array}[]{ll}\frac{1-p}{2a(n+1)}y^{2}&n+1\geq y\geq 0\\ \frac{1-p}{2a}y&y>n+1\\ \end{array}\right..

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning · Machine Learning and Algorithms

Full text

Thompson Sampling for Adversarial Bit Prediction

Yuval Lewi Haim Kaplan Yishay Mansour Tel Aviv University. Email: [email protected].Tel Aviv University and Google Research. Email: [email protected].Tel Aviv University and Google Research. Email: [email protected].

Abstract

We study the Thompson sampling algorithm in an adversarial setting, specifically, for adversarial bit prediction. We characterize the bit sequences with the smallest and largest expected regret. Among sequences of length $T$ with $k<\frac{T}{2}$ zeros, the sequences of largest regret consist of alternating zeros and ones followed by the remaining ones, and the sequence of smallest regret consists of ones followed by zeros. We also bound the regret of those sequences, the worst case sequences have regret $O(\sqrt{T})$ and the best case sequence have regret $O(1)$ .

We extend our results to a model where false positive and false negative errors have different weights. We characterize the sequences with largest expected regret in this generalized setting, and derive their regret bounds. We also show that there are sequences with $O(1)$ regret.

1 Introduction

Online learning and multi-arm bandits (MAB) are one of the most basic models for uncertainty, which are widely studied in machine learning. The main performance criteria used in this model is regret, which is the difference between the expected loss of the online algorithm, and the loss of the best algorithm from a benchmark class. (See, [1, 2, 3, 4]). Bit prediction is one of the first problems for which online learning regret was analyzed [5], and has been extensively studied throughout the years (see, [6]).

Thompson sampling ([7]) is one of the earliest algorithms for MAB. It was originally motivated by a Bayesian setting, where the rewards are stochastic, and the reward of each action has a prior distribution. The algorithm maintains a posterior distribution for the reward of each action, and in each step, samples the posterior distribution of the mean reward of each action, and uses the action with the highest sampled value. In recent years, there has been a renewed interest in the Thompson sampling algorithm and its applications (see, [8]), mainly due to its simplicity and good performance in practice.

Since Thompson sampling was designed for a Bayesian setting, it is natural to analyze its Bayesian regret (i.e., average the regret with respect to the prior). In many settings, we get an elegant analysis and asymptotically optimal regret bounds. (See, [3, 4, 9]).

While Thompson sampling was designed for a Bayesian setting, it was also recently analyzed in worst-case stochastic setting. More specifically, assume that the reward of each action is a Bernoulli random variable with unknown success probability. Unlike the Bayesian setting, there is no true prior over these parameters (success probabilities), and we want to bound the regret for the worst choice of the parameters. In this setting we start the Thompson sampling algorithm with a fictitious prior, say, a uniform distribution (of the success probability) for each action, and we update the posterior as though we were in the Bayesian setting. The works of [10, 11] show that Thompson sampling guarantees almost optimal regret bounds in the adversarial stochastic setting. Improved regret bounds which are parameter dependent are given in [12].

The papers mentioned above show the great success of Thompson sampling in stochastic settings, thus it is natural to investigate its performance in adversarial online model. In this model TS starts with a fictitious prior and an adversary selects the arbitrary input sequence. The completely adversatial model can be viewed as bounding the regret of the worst-case sequence possible, rather then the expected regret over some distribution in the stochastic settings. Specifically in this paper, our goal is to show that Thompson Sampling is successful for the adversarial bit sequence settings.

Our work considers the performance of Thompson sampling in an adversarial setting. Specifically, we consider the case of adversarial bit prediction, where the learner observes an arbitrary binary sequence, and at each time step predicts the next bit. The loss of the learner is the number of errors it makes, and the regret is the difference between the number of errors the online learner algorithm makes and best static bit prediction, i.e., the minimum between the number of ones and zeros in the sequence. We characterize the bit sequences on which Thompson sampling algorithm has the largest and smallest regret. We bound the regret of these sequences, and show that the worst case regret is $\Theta(\sqrt{T})$ , for a sequence of length $T$ , and best case regret of $\Theta(1)$ .

More specifically, we initialize our Thompson sampling algorithm with a uniform (i.e., $\beta(1,1)$ ) prior distribution, and maintain a posterior beta distribution (whose parameters correspond to the number of ones and zeros seen so far). To predict the next bit, we draw a value from the beta posterior and predict one if the value is larger than $\frac{1}{2}$ . Once we observe the bit we update our posterior.

For sequences of length $T$ with $k\leq\frac{T}{2}$ zeros, we show that the sequences with the largest regret are of the form $\{01,10\}^{k}1^{T-2k}$ , and the sequence with the smallest regret is $1^{T-k}0^{k}$ (for $k=\frac{T}{2}$ both sequences $1^{T/2}0^{T/2}$ and $0^{T/2}1^{T/2}$ have the same smallest regret). For example, if $k=2$ and $T=7$ , the sequences with the largest regret are $0101111,0110111,1001111$ and $1010111$ , and the sequence with the smallest regret is $1111100$ . For $k>\frac{T}{2}$ , we have the same characterization with $1$ and [math] interchanged. We also bound the regret of these sequences and show that the expected regret on the worst case sequences is $\Theta(\sqrt{T})$ and that the expected regret on the best case sequences is $\Theta(1)$ .

We extend the model to have different losses for false positive and false negative errors. Specifically, we have a trade-off parameter $q\in[0,1]$ and we define the cost of a false positive to be $q$ and the cost of a false negative to be $1-q$ . We call this extended model the generalized bit-prediction model. Note that for $q=\frac{1}{2}$ this loss is simply the number of errors multiplied by $\frac{1}{2}$ , so this is a strict generalization our previous loss. Thompson sampling adapts naturally to the parameter $q$ , by simply predicting one when the sampled value is larger than $q$ (rather than larger than $\frac{1}{2}$ ). We characterize for each $q\in[0,1]$ the bit sequences with the largest regret for this model and bound their regret. For example, for sequences of length $T=100$ with $20$ zeros and $q=\frac{1}{3}$ , the worst case sequences are of the form $\{010,001\}^{10}1^{70}$ . In general, we show a family of bit-sequences with the highest regret for every trade-off parameter $q\in[0,1]$ , number of zeros and number of ones. From that we conclude that the regret of Thompson sampling in the adversarial bit-prediction model is bounded by $O(\sqrt{q(1-q)T})$ . We also show that there are sequences with regret equals or less then $1$ without characterizing the best sequences.

Our work shows the great versatility of Thompson sampling. Namely, the same algorithm, with a prior of $\beta(1,1)$ , can be analysed in Bayesian setting, when it is given the true prior, in an adversarial stochastic setting, when it is given a fictitious prior, and in the adversarial bit prediction problem, which we analyse in this work. Thompson sampling is not the only algorithm that achieves good performance both for adversarial and stochastic rewards (See, [13, 14, 15]), but it achieves this in a simple natural way, and as a side-product of a general Bayesian methodology, without trying to identify the nature of the environment.

1.1 Other related work

Adversarial bit prediction has a long history, starting with [5], and followed up by many additional works (see, [1]). The exact min-max optimal strategy can be derived, when we view the problem as a zero-sum game (see, [6]). The min-max optimal regret bound for the case of two actions was derived by [5] and for three actions by [16]. Prediction of the next character in non-binary sequences has also received considerable attention, with respect to various benchmarks [17, 18]. For the stochastic case, prediction of the next character in non-binary sequences was studied using Bayesian methods by [19]. Prediction of binary sequences with the log-loss in online adversarial environment has been studied by many due to its relation to data compression and information-theory (see for example, [20], [21] and [22]).

Adversarial online learning and multi-arm bandits have received significant attention in machine learning in the last two decades. (See the following books and surveys, [1, 2, 3, 4]). A lower bound for the adversarial MAB problem was presented by [23]. Notable results in adversarial online learning are the algorithm EXP3 (see, [24]) for adversarial bandits, the algorithm UCB1 (see, [25]) for stochastic bandits, and the regret analysis of the min-max algorithm (see, [26]).

Thompson sampling has been studied in different environments over the years. In [27] it was observed that Thompson sampling with a Gaussian prior is equivalent to "Follow the Perturbed Leader" (FPL) of [28], and that fact was used to deduced the worst case regret of Thompson sampling with Gaussian distributions. A prior-dependent analysis was introduced by [9] using an information-theoretic tools, and the idea was expanded for first and second-order regret bounds by [29].

Thompson sampling also showed good experimental results (see, [30, 31]). Because of that, the algorithm is used in practice, with recommendation systems as an example (see, [32]). In Reinforcement Learning, a version of Thompson sampling called "Posterior Sampling for Reinforcement Learning" (PSRL) is used (see, [33, 34]). Bounds for the algorithm were proved in [35].

2 Model

A bit prediction game proceeds as follows. At time $t\in\left[T\right]=\{1,...,T\}$ the learner outputs a bit $\hat{\gamma}_{t}\in\{0,1\}$ . Then, the learner observes a bit $\gamma_{t}\in\{0,1\}$ and suffers a loss of $\ell\left(\hat{\gamma}_{t},\gamma_{t}\right)=\mathbb{I}\{\hat{\gamma}_{t}\neq\gamma_{t}\}$ .

We compare the loss of the online algorithm to a benchmark, which is the loss of the best static bit prediction. Given a bit sequence $\Gamma=\left(\gamma_{1},...,\gamma_{T}\right)$ , let the number of ones up to $t$ be $O_{t}\left(\Gamma\right)=\lvert\left\{i\in\left[t\right]:\gamma_{i}=1\right\}\rvert=\sum_{i=1}^{t}\gamma_{i}$ and the number of zeros be $Z_{t}\left(\Gamma\right)=\lvert\left\{i\in\left[t\right]:\gamma_{i}=0\right\}\rvert$ $=\sum_{i=1}^{t}\left(1-\gamma_{i}\right)$ . The loss of the best static bit prediction is

[TABLE]

The goal of the learner is to minimize the regret, which is the difference between the online cumulative loss and the loss of the best static bit prediction. Specifically, for an algorithm $A$ ,

[TABLE]

where $\Gamma\in\{0,1\}^{T}$ is a fixed bit sequence, and the expectation is taken over the predictions of algorithm $A$ . We extend the standard bit prediction game and define a generalized bit prediction game, where the false positive (FP) and false negative (FN) errors have different weights.111A false positive error is when the learner predicts $\hat{\gamma}_{t}=1$ and $\gamma_{t}=0$ , and false negative error is when $\hat{\gamma}_{t}=0$ and $\gamma_{t}=1$ . Given a trade-off parameter $q\in[0,1]$ , we define a loss $\ell^{q}$ , as follows,

[TABLE]

Namely, the false positive errors are weighted by $q$ while the false negative errors are weighted by $1-q$ . Note that for $q=\frac{1}{2}$ , for any $(\hat{\gamma}_{t},\gamma_{t})$ we have that $\ell^{1/2}(\hat{\gamma}_{t},\gamma_{t})=\frac{1}{2}\ell(\hat{\gamma}_{t},\gamma_{t})$ , so for $q=\frac{1}{2}$ the extended loss is essentially the 0-1 loss.

Similarly, the benchmark for the generalized bit prediction is the best static bit prediction, namely,

[TABLE]

and the regret of algorithm $A$ on a given bit sequence $\Gamma\in\{0,1\}^{T}$ is

[TABLE]

2.1 Distributions

We use extensively the Beta distribution, denoted by $\beta(a,b)$ , where $a,b>0$ , and the Binomial distribution, denoted by $Bin(n,p)$ where $n$ is the number of trials and $p\in[0,1]$ is the success probability. We denote by $Ber(p)$ a Bernoulli random variable with success probability $p\in[0,1]$ . For a distribution $D$ , the Cumulative Distribution Function (CDF) is denoted by $F_{D}$ .

The following identity is a well known fact related to the the Beta distribution (see, [36], Eq. 8.17.4)

Fact 1.

For $a,b\in\mathbb{N^{+}}$ and $p\in[0,1]$ we have $F_{\beta(a,b)}(p)=1-F_{\beta(b,a)}(1-p)$ .

The $\beta(a,b)$ distribution is widely used in Bayesian setting to define the uncertainty over the parameter $p$ of a Bernoulli random variable $Ber(p)$ . The distribution $\beta(1,1)$ , which is the uniform distribution over $[0,1]$ , is used as the prior distribution of $p$ . Given $a+b$ observations of the random variable $Ber(p)$ , where $a$ is the number of realizations which are $1$ and $b$ is the number of realizations which are [math], then the posterior distribution of $p$ is $\beta(a+1,b+1)$ (assuming the prior distribution is $\beta(1,1)$ ).

The following is a well known property of the CDF of the Beta distribution.

Fact 2.

[36, Eq. 8.17.20-21]** For every $x\in\left[0,1\right]$ and $a,b\in\mathbb{R}$ s.t. $a,b>0$ , the following holds

[TABLE]

where $B(a,b)=\frac{(a-1)!(b-1)!}{(a+b-1)!}$ is the Beta function.

For the analysis we use the following theorems regarding the tail of the $\beta(a,b)$ distribution, when we fix the parameter $b=n+1$ and sum over parameters $a\geq 1$ .

Theorem 3.

For every $n\geq 1$ we have $\sum_{i=n+1}^{\infty}F_{\beta(i+1,n+1)}\left(\frac{1}{2}\right)=O(\sqrt{n})$ .

2.2 Notations

When the bit sequence $\Gamma=(\gamma_{1},\dots,\gamma_{T})$ can be inferred from the context, we use $O_{t}$ and $Z_{t}$ rather than $O_{t}(\Gamma)$ and $Z_{t}(\Gamma)$ .

We also define the $sign$ function as $sign(x)=\left\{\begin{smallmatrix}1&x>0\\ 0&x=0\\ -1&x<0\end{smallmatrix}\right.$ .

For functions $f,g\in\mathbb{R}\rightarrow\mathbb{R}$ we denote $g=O(f)$ iff there exist $c_{1},c_{2}\in\mathbb{R}$ such that $g(x)\leq c_{1}f(x)+c_{2}$ for every $x\in\mathbb{R}$ .

3 Thompson sampling for bit prediction

The Thompson sampling algorithm requires a prior distribution for its initialization. Given the observations, it updates the prior distribution to a posterior distribution. The learner samples the posterior distribution, and thresholds the sampled value at half (for bit prediction) or $q$ (for generalized bit prediction).

More specifically. We consider the prior distribution $\beta(1,1)$ , which is a uniform distribution over $[0,1]$ . Note that this prior is fictitious, and used only to initialize the Thompson sampling algorithm. At time $t$ the learner samples a value $x_{t}$ from the distribution $\beta(O_{t-1}+1,Z_{t-1}+1)$ , where $O_{t-1}$ and $Z_{t-1}$ are the number of observed $1$ ’s and [math]’s up to time $t-1$ , respectively. At time $t$ the learner predicts $\hat{\gamma}_{t}=\mathbb{I}\{x_{t}>q\}$ , where $q$ is the trade-off parameter of the loss. Then the learner observes the feedback bit $\gamma_{t}$ and suffers loss $\ell^{q}(\hat{\gamma}_{t},\gamma_{t})$ . The resulting Thompson sampling algorithm is described in Algorithm 1, and in the analysis we refer to this algorithm as $TS(q)$ .

In Section 4 we prove the “Swapping Lemma”, which analyses the effect of a single swap on the regret, which allows us to identify the sequences with the largest and smallest regret. In Section 5 we bound the regret of these sequences, thereby obtaining tight upper and lower bounds on the regret. Section 6 addresses the generalized bit prediction case.

4 Swapping Lemma

In this section we compare the regret of two bit sequences which differ by a single swap. This is an essential building block in our analysis of the worst case and the best case regret of the Thompson sampling algorithm.

Swap operation: Given a bit sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ , performing the swap operation at position $t\in\left[T\right]$ results in a sequence that swaps $\gamma_{t}$ and $\gamma_{t+1}$ in $\Gamma$ and keeps all other bits unchanged. Formally, $Swap(\Gamma,t)=(\gamma_{1},\ldots,\gamma_{t-1},\gamma_{t+1},\gamma_{t},\gamma_{t+2},\ldots,\gamma_{T})$ .

The swapping lemma that compares the regret of Thompson sampling, $TS(q)$ , on the bit sequences $\Gamma$ and $Swap(\Gamma,t)$ .

To illustrate the swapping lemma consider the case $q=\frac{1}{2}$ , so $\frac{q}{1-q}=1$ . If we had more zeros up to position t-1 then having the one earlier increases the regret. If we had more ones up to position t-1 then having zero earlier increases the regret. More precisely, for each $t$ such that $\gamma_{t}=0$ , $\gamma_{t+1}=1$ and $O_{t-1}<Z_{t-1}$ , swapping $\gamma_{t}$ and $\gamma_{t+1}$ increases the regret. Similarly, if $\gamma_{t}=1$ , $\gamma_{t+1}=0$ and $O_{t-1}>Z_{t-1}$ then swapping $\gamma_{t}$ and $\gamma_{t+1}$ increases the regret. In other words,

Lemma 4 (Swapping Lemma).

Fix a bit sequence $\Gamma=\left(\gamma_{1},\ldots,\gamma_{T}\right)\in\left\{0,1\right\}^{T}$ . For every $t$ , such that $\gamma_{t}=0$ and $\gamma_{t+1}=1$ , we have

[TABLE]

For every $t$ , such that $\gamma_{t}=1$ and $\gamma_{t+1}=0$ , we have

[TABLE]

In addition,

[TABLE]

**Proof Sketch ** We consider the difference between the regret of $TS(q)$ on the bit sequence $\Gamma$ and on the bit sequence $Swap(\Gamma,t)$ . The two bit sequences differ only at locations $t$ and $t+1$ . Since the benchmark of a sequence depends only on the total number of zeros and ones in the sequence, the benchmarks on $\Gamma$ and $Swap(\Gamma,t)$ are identical, i.e., $static^{q}(\Gamma)=static^{q}(Swap(\Gamma,t))$ . Therefore, the difference between the regrets is equals to the difference between the losses at time $t$ and $t+1$ .

Consider time $t\in\left[T\right]$ such that $\gamma_{t}=0$ and $\gamma_{t+1}=1$ .Using the insights above it is easy to show that,

[TABLE]

Using the recurrence relations in Fact 2 we show that,

[TABLE]

Since $\frac{q^{O_{t-1}+1}(1-q)^{Z_{t-1}+1}}{B\left(O_{t-1}+1,Z_{t-1}+1\right)}>0$ , we have

[TABLE]

and equality holds iff $\frac{q}{1-q}=\frac{O_{t-1}+1}{Z_{t-1}+1}$ . The second case, where $\gamma_{t}=1$ and $\gamma_{t+1}=0$ , is similar.

5 Regret characterization for

$q=\frac{1}{2}$

In this section we use the swapping lemma to characterize the sequences on which $TS(\frac{1}{2})$ has the largest and smallest regret. We denote by $k$ the number of zeros in the sequence and characterize the sequences of worst and best regret for each $k$ . Notice that we may assume that $k\leq\frac{T}{2}$ since any sequence $\Gamma$ has the same regret as the sequence $\Gamma^{\prime}$ obtained from $\Gamma$ by flipping each bit. Indeed, $static(\Gamma)=static(\Gamma^{\prime})$ and the expected loss of $TS(\frac{1}{2})$ on $\Gamma$ and $\Gamma^{\prime}$ is the same (by Fact 1).

5.1 Worst-case regret

Consider bit sequences $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ with $k$ zeros, where $k\leq\frac{T}{2}$ . We first show that among these bit sequences the ones of largest regret are of the form $\{01,10\}^{k}1^{T-2k}$ . Then, we prove that the regret of each of these sequences is $\Theta(\sqrt{k})$ .

Theorem 5.

For any $\Gamma_{1},\Gamma_{2}\in\{01,10\}^{k}1^{T-2k}$ we have $Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{1})=Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{2})$ . In addition, for any $\Gamma_{3}\notin\{01,10\}^{k}1^{T-2k}$ we have $Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{1})>Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{3})$ .

Proof.

Note that for any $i\in[k]$ we have $O_{2i}(\Gamma_{1})=Z_{2i}(\Gamma_{1})=i$ . By Lemma 4 this implies that $Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{1})=Regret^{1/2}_{TS(\frac{1}{2})}(Swap(\Gamma_{1},i))$ . Since we can transform $\Gamma_{1}$ to $\Gamma_{2}$ by a sequence of swap operations at certain locations $2i$ , it follows that $Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{1})=Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{2})$ . This implies that all the sequences of the form $\{01,10\}^{k}1^{T-2k}$ have the same regret.

Let $\Gamma_{3}=(\gamma_{1},\ldots,\gamma_{T})\in\{0,1\}^{T}$ be a bit sequence of length $T$ with $k$ zeros such that $\Gamma_{3}\notin\{01,10\}^{k}1^{T-2k}$ . We show that for some $t\in\left[T\right]$ , the sequence $Swap(\Gamma_{3},t)$ has a regret larger than $\Gamma_{3}$ .

Since $\Gamma_{3}\notin\{01,10\}^{k}1^{T-2k}$ , there is an index $i\leq k-1$ such that either $\gamma_{2i+1}=\gamma_{2i+2}=1$ or $\gamma_{2i+1}=\gamma_{2i+2}=0$ . Let $i$ to be the smallest such index. Assume that $\gamma_{2i+1}=\gamma_{2i+2}=1$ . (The case of $\gamma_{2i+1}=\gamma_{2i+2}=0$ is similar.) It follows that $O_{2i}=Z_{2i}$ and $O_{2i+1}=Z_{2i+1}+1$ . Let $j>2i+2$ be the minimal index such that $\gamma_{j}=0$ . Such an index must exist, since there are $k$ zeros in $\Gamma_{3}$ and until index $2i$ there were only $i\leq k-1$ zeros. Since $\gamma_{j-1}=\gamma_{j-2}=1$ we have $\frac{O_{j-1}}{Z_{j-1}}>\frac{O_{j-2}}{Z_{j-2}}\geq\frac{O_{2i+1}}{Z_{2i+1}}>1$ . By Lemma 4, the sequence $Swap(\Gamma_{3},j-1)$ has regret higher than $\Gamma_{3}$ , i.e., $Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma_{3})<Regret^{1/2}_{TS(\frac{1}{2})}(Swap(\Gamma_{3},t))$ .

Since there are finite number of bit sequences of length $T$ with $k$ zeros, we get that sequences with the largest regret must be of the form $\{01,10\}^{k}1^{T-2k}$ . ∎

Given the above theorem, to bound the worst case regret of $TS(\frac{1}{2})$ , we can focus on the sequence $W_{T}^{k}=\{01\}^{k}1^{T-2k}$ and bound $Regret^{1/2}_{TS(\frac{1}{2})}(W_{T}^{k})$ .

Theorem 6.

For every $T\in\mathbb{N}^{+}$ and $k\leq\frac{T}{2}$ we have, $Regret^{1/2}_{TS(\frac{1}{2})}(W_{T}^{k})=\Theta(\sqrt{k})$ .

**Proof Sketch ** Let $W_{T}^{k}=(w_{1},\ldots,w_{T})$ , where we have: (1) $w_{t}=0$ for $t\in A_{1}=\{2i-1\mid i\in[k]\}$ , (2) $w_{t}=1$ for $t\in A_{2}=\{2i\mid i\in[k]\}$ , and (3) $w_{t}=1$ for $t\in A_{3}=\{i\mid i\geq 2k+1\}$ . We bound the expected number of errors made by $TS(\frac{1}{2})$ on each of these three subsets. Then, from these bounds we derive a bound on the loss and the regret. Specifically we prove the following:

For $t\in A_{1}$ , $Z_{t}=O_{t}$ and thus the probability to predict the next bit is $\frac{1}{2}$ . Therefore, the expected number of false positive errors in $A_{1}$ is

[TABLE] 2. 2.

For $t\in A_{2}$ , $Z_{t}=O_{t}+1$ and the difference between the probability to predict 0 and the probability to predict 1 is small and can be bounded. Therefore, the expected number of false negative errors in $A_{2}$ is

[TABLE] 3. 3.

The expected number of false negative in $A_{3}$ is show to be

[TABLE]

where the last equality follows from Theorem 3.

Summing up the errors over $A_{1}$ , $A_{2}$ , and $A_{3}$ , and recalling that the static prediction makes $\min\{T-k,k\}=k$ errors, we bound the regret as follows

[TABLE]

Since $k\leq\frac{T}{2}$ , we have the following corollary.

Corollary 7.

For any sequence of length $T$ , the regret of $TS(\frac{1}{2})$ is at most $O(\sqrt{T})$ .

Remark 8.

Note that in fact we proved that $Regret^{1/2}_{TS(\frac{1}{2})}(\Gamma)=\Theta(\sqrt{\min\{O_{T}(\Gamma),Z_{T}(\Gamma)\}})$ .

5.2 Best-case regret

In this subsection, we characterize the sequences with the lowest regret and bound them.

Theorem 9.

The bit sequence with the lowest regret of length $T$ with $k<\frac{T}{2}$ zeros is $B_{T}^{k}=1^{T-k}0^{k}$ . For $k=\frac{T}{2}$ , both $1^{T/2}0^{T/2}$ and $0^{T/2}1^{T/2}$ have the lowest regret.

We now bound the regret of $B_{T}^{k}$ .

Theorem 10.

For every $T\in\mathbb{N}^{+}$ and $k\leq\frac{T}{2}$ we have, $Regret^{1/2}_{TS(\frac{1}{2})}(B_{T}^{k})\leq 1$ , where $B_{T}^{k}=1^{T-k}0^{k}$ .

6 Regret characterization for a general $q$

To get some intuition regarding this generalization to an arbitrary trade-off parameter $q$ consider the following simple example. Assume that $q=\frac{1}{3}$ , and thereby $\frac{q}{1-q}=\frac{1}{2}$ and lets construct a sequence such that we cannot increase the regret by swapping any pair of consecutive bits. This sequence cannot start with a $1$ , since if it does then by the swapping lemma (Lemma 4 we will be able to increase the regret by swapping the first [math] with the $1$ preceding it. So we must start with a [math]. In general we determine bit $t+1$ by comparing $\frac{O_{t}+1}{Z_{t}+1}$ to $\frac{1}{2}$ (i.e., $\frac{q}{1-q}$ ). If they are equal then the bit in position $t+1$ is either [math] or $1$ . If $\frac{O_{t}+1}{Z_{t}+1}>\frac{1}{2}$ the bit in position $t+1$ is [math] since otherwise we will be able to increase the regret by swapping the first [math] following position $t+1$ with its preceding $1$ . Similarly, if $\frac{O_{t}+1}{Z_{t}+1}<\frac{1}{2}$ the bit in position $t+1$ is $1$ since otherwise we will be able to increase the regret by swapping the first $1$ following position $t+1$ with its preceding [math].

It follows that the second bit could be either [math] or $1$ since $\frac{O_{1}+1}{Z_{1}+1}=\frac{q}{1-q}=\frac{1}{2}$ . If we have a [math] at position $2$ then $\frac{O_{2}+1}{Z_{2}+1}=\frac{1}{3}<\frac{1}{2}$ and therefore we must continue with a $1$ at position $3$ . Then we have that $\frac{O_{3}+1}{Z_{3}+1}=\frac{2}{3}>\frac{1}{2}$ so we put [math] at position $4$ , and we are back in the situation where $\frac{O_{4}+1}{Z_{4}+1}=\frac{1}{2}$ so we can choose either [math] or $1$ at position $5$ . Similarly, if we place a $1$ at position $2$ then we will have to continue with two [math]’s and then we will be free to choose at position $5$ either [math] or $1$ . It follows that the family of sequences of the form $0\{100,010\}^{*}x\{1^{*},0^{*}\}$ (where $x$ could be any prefix of $100$ or $010$ ) contains all sequences of largest regret. (We will in fact show that they all have the same regret.)

To gain some deeper intuition assume now that $q$ is a rational number and $\frac{q}{1-q}=\frac{n_{1}}{n_{2}}$ (where $n_{1}$ and $n_{2}$ do not have common divisors) and lets try to construct a sequence that we cannot increase its regret by applying the swapping lemma. Whenever $\frac{O_{t}+1}{Z_{t}+1}=\frac{n_{1}}{n_{2}}$ we can choose any bit to position $t+1$ . At this point we have that $n_{2}(O_{t}+1)=n_{1}(Z_{t}+1)$ and therefore $n_{1}(Z_{t}+1)$ is a multiple of $n_{2}$ and $n_{2}(O_{t}+1)$ is a multiple of $n_{1}$ . Once we choose, say [math], then we are forced to choose a particular sequence in the following $n_{1}+n_{2}-1$ steps, until we will again have that $n_{2}(O_{t^{\prime}}+1)=n_{1}(Z_{t^{\prime}}+1)$ for $t^{\prime}=t+n_{1}+n_{2}$ among these bits $n_{2}$ would be zeros and $n_{1}$ would be ones so $Z_{t^{\prime}}=Z_{t}+n_{2}$ $O_{t^{\prime}}=O_{t}+n_{1}$ .

The structure of this section is similar to the structure of Section 5. First, we characterize the bit sequences of largest regret. Then, we bound the regret of these sequences.

6.1 Worst-case sequences

Consider the following function that maps a bit-sequence to a set of bits

[TABLE]

where $O(\Phi)$ is the total number of $1$ s in $\Phi$ and $Z(\Phi)$ is the total number of [math]s in $\Phi$ .

For every sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})\in\{0,1\}^{T}$ we define $p(\Gamma)$ to be the largest index $t$ s.t. $\forall i\in[t]:\gamma_{i}\in H^{q}(\Gamma_{1:i-1})$ , where $\Gamma_{1:n}=(\gamma_{1},\ldots,\gamma_{n})$ . We call a bit sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ a worst-case sequence if $\gamma_{p(\Gamma)+1}=\ldots=\gamma_{T}$ . We define the subsequence $(\gamma_{1},\ldots,\gamma_{p(\Gamma)})$ as the *head * of $\Gamma$ and denote it $head(\Gamma)$ and the subsequence $(\gamma_{p(\Gamma)+1},\ldots,\gamma_{T})$ as the tail of $\Gamma$ and denote it $tail(\Gamma)$ .

For start, we characterize the tail of a worst-case sequence.

Theorem 11.

Let $\Gamma$ be a worst-case sequence. If $Z_{T}\leq(1-q)T-q$ then the $tail(\Gamma)$ is filled with ones. Otherwise, the $tail(\Gamma)$ is filled with zeros.

6.2 Worst-case regret

In this subsection we prove that all the worst-case sequences have the largest regret and prove an upper bound on this regret.

Theorem 12.

Let $\Gamma\in\{0,1\}^{T}$ , s.t. $\Gamma$ is not a worst-case sequence. Then, there exists $t\in[T]$ such that $Regret^{q}_{TS(q)}(\Gamma)<Regret^{q}_{TS(q)}(Swap(\Gamma,t))$ .

Proof.

Let $i=p(\Gamma)+1$ . Since $\Gamma$ is not a worst-case sequence, there is an index $j>i$ such that $\gamma_{j}\not=\gamma_{i}$ (since, from Theorem 11, $tail(\Gamma)$ contains both [math]’s and $1$ ’s). Assume $j$ is the smallest index with this property.

Case 1 Assume $\gamma_{i}=0$ and $\gamma_{j}=1$ . Since $\gamma_{i}\notin H^{q}(\Gamma_{1:i-1})$ we have $\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}<\frac{q}{1-q}$ . From the definition of $j$ follows that $\gamma_{i}=\gamma_{i+1}=\ldots=\gamma_{j-1}=0$ and thus $\frac{O_{j-2}(\Gamma)+1}{Z_{j-2}(\Gamma)+1}\leq\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}<\frac{q}{1-q}$ . By Lemma 4, the sequence $Swap(\Gamma,j-1)$ has a regret larger than $\Gamma$ .

Case 2 Assume $\gamma_{i}=1$ and $\gamma_{j}=0$ . Since $\gamma_{i}\notin H^{q}(\Gamma_{1:i-1})$ we have $\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}>\frac{q}{1-q}$ . From the definition of $j$ follows that $\gamma_{i}=\gamma_{i+1}=\ldots=\gamma_{j-1}=1$ and thus $\frac{O_{j-2}(\Gamma)+1}{Z_{j-2}(\Gamma)+1}\geq\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}>\frac{q}{1-q}$ . By Lemma 4, the sequence $Swap(\Gamma,j-1)$ has a regret larger than $\Gamma$ . ∎

Theorem 12 implies that any sequence of largest regret is a worst-case sequence. Next we prove that all worst-case sequences of length $T$ with $k$ zeros have the same regret.

Lemma 13.

All the worst-case sequences of length $T$ with $k$ zeros have the same regret.

Let $W_{T}^{k}=(w_{1},\ldots,w_{T})\in\{0,1\}^{T}$ be a worst-case sequence with $k$ zeros such that for all $t\leq p(W_{T}^{k})$ with $\frac{O_{t-1}+1}{Z_{t-1}+1}=\frac{q}{1-q}$ we have $\gamma_{t}=0$ . Since by Lemma 13 all the worst-case sequences with the same number of zeros have the same regret, we can focus on bounding the regret of $W_{T}^{k}$ .

Theorem 14.

For every $T\in\mathbb{N}^{+}$ , $q\in\left[0,\frac{1}{2}\right]$ and $k$ zeros we have

[TABLE]

The regret bounds for $q\in[\frac{1}{2},1]$ are derived from the Theorem 14 using the following lemma.

Lemma 15.

For every bit sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ define $\bar{\Gamma}=(1-\gamma_{1},\ldots,1-\gamma_{T})$ . Then, $Regret^{q}_{TS(q)}\left(\Gamma\right)=Regret^{1-q}_{TS(1-q)}$

The following theorem derives the worst-case sequences regret bound for general $q$ .

Theorem 16.

For any observation sequence of length $T$ , the regret of $TS(q)$ is $O\left(\sqrt{q(1-q)T}\right)$ .

6.3 Best-case regret bound

We do not characterize the exact best-case regret sequences222Finding the best-case sequence characterization for a general trade-off parameter $q$ is harder than the previous cases. With the tools we presented, it is difficult even to compare the regrets of the bit sequences $10^{k}$ and $0^{k}1$ for $k\in\mathbb{N}$ ., but only show that there are sequence with regret at most $1$ .

Theorem 17.

*For every $q\in(0,1)$ and $m,n\in\mathbb{N}$ , if $qm\leq(1-q)n$ , then $Regret^{q}_{TS(q)}(1^{n}0^{m})\leq 1$ and otherwise $Regret^{q}_{TS(q)}(0^{m}1^{n})\leq 1$ . *

Acknowledgments

This work was supported in part by the Yandex Initiative in Machine Learning and by a grant from the Israel Science Foundation (ISF).

Appendix A Beta and Binomial concentration bounds

The following identities are well known (see, for example, [10], Fact 3 and [36], Eq. 8.17.4).

The first relates the CDFs of the Beta and the Binomial distributions. The second is a property of the Beta distribution.

Fact 18.

For $a,b\in\mathbb{N^{+}}$ and $p\in[0,1]$ we have $F_{\beta(a,b)}(p)=1-F_{Bin(a+b-1,p)}(a-1)$ .

Fact 19.

For $a\in\mathbb{N^{+}}$ and $p\in[0,1]$ we have $F_{\beta(a,1)}(p)=p^{a}$ .

Next, we present concentration bounds and inequalities that we need for our proofs.

Fact 20.

(Gaussian Half CDF)

Let $\sigma\in\mathbb{R^{+}}$ . Then $\frac{1}{\sqrt{2\pi\sigma^{2}}}\int\limits_{0}^{\infty}e^{-\frac{x^{2}}{2\sigma^{2}}}dx=\frac{1}{2}$ .

Fact 21.

(Multiplicative Chernoff bound) [37]

Let $X_{1},...,X_{n}$ be random variables with values of $\{0,1\}$ such that $\operatorname{E}[X_{t}|X_{1},...,X_{t-1}]=\mu$ . Let $S_{n}=\sum\limits_{i=1}^{n}X_{i}$ .

For $1\geq a\geq 0$ , $\Pr\left(S_{n}\geq(1+a)n\mu\right)\leq e^{-\frac{a^{2}n\mu}{3}}$ . 2. 2.

For $a\geq 1$ , $\Pr\left(S_{n}\geq(1+a)n\mu\right)\leq e^{-\frac{an\mu}{3}}$ .

Fact 22.

(Chernoff-Hoeffding) [38]

Let $X_{1},...,X_{n}$ be random variables with common range $[0,1]$ such that $\operatorname{E}[X_{t}|X_{1},...,X_{t-1}]=\mu$ . Let $S_{n}=\sum\limits_{i=1}^{n}X_{i}$ .

For all $a\geq 0$ , $\Pr\left(\lvert S_{n}-n\mu\rvert\geq a\right)\leq 2e^{-\frac{2a^{2}}{n}}$ . 2. 2.

For $\mu\geq\frac{1}{2}$ and $a\geq 0$ , $\Pr\left(S_{n}>n\mu+a\right)\leq e^{-\frac{a^{2}}{2n\mu(1-\mu)}}$ .

Appendix B Proof of bounds on sums of Beta CDFs (Theorems 3 and 25)

We present two bounds for sums of Beta CDFs. In the first subsection we prove a simple version of our bound, which appears Theorem 3. In the second subsection we expend the result to a general $q\in(0,1)$ .

B.1 Proof of Theorem 3

The proof is divided into two parts. First we prove a bound on a series of exponents and then use Hoeffding bound to show that the exponent series is an upper bound for the sum of beta-distribution CDFs appears in Theorem 3.

Lemma 23.

For every $n\geq 1$ , $\sum\limits_{i=n+1}^{\infty}e^{-\frac{(i-(n+1))^{2}}{2(i+n+1)}}=\Theta(\sqrt{n})$ .

Proof.

Let $j=i-(n+1)$ , then

[TABLE]

We bound from below and above the exponents. For the upper bound we use the fact that $j\geq 0$ and for lower bounding the exponent we consider two cases: (a) $j>2(n+1)$ and (b) $2(n+1)\geq j\geq 0$ . We have,

[TABLE]

We bound the sum (2) from below using Fact 20, where $\sigma^{2}=2(n+1)$ , as follows

[TABLE]

For upper bounding Eq. (2) we have,

[TABLE]

The first sum of the right side of Eq. (3) is bounded, by using Fact 20 with $\sigma^{2}=4(n+1)$ , as follows

[TABLE]

The second sum of the right hand side of Eq. (3) is an exponential sum and bounded as follows,

[TABLE]

By combining the previous inequalities and Eq. (3) we get $\sum\limits_{i=n+1}^{\infty}e^{-\frac{(i-(n+1))^{2}}{2(i+n+1)}}=\Theta(\sqrt{n})$ . ∎

See 3

Proof.

Using Fact 18

[TABLE]

Note that $\frac{i-(n+1)}{2}\geq 0$ when $i\geq n+1$ , therefore we can use the Chernoff-Hoffding bound (Fact 22.1) to achieve

[TABLE]

where the last equality follows from Lemma 23. ∎

B.2 Proof of Theorem 25

The following subsection generalizes the proof of Theorem 3, as presented in Appendix B.1. We divide the generalized theorem version proof into two parts similarly to Appendix B.1.

Lemma 24.

For every $n\in\mathbb{N}^{+}$ , $a>0$ and $p\in(0,1)$ we have

$\sum\limits_{i=\left\lceil\frac{p}{1-p}(n+1)\right\rceil+1}^{\infty}e^{-\frac{((1-p)i-p(n+1))^{2}}{a(i+n+1)}}\leq\frac{\sqrt{\pi a(n+1)}}{\sqrt{2}(1-p)^{3/2}}+\frac{2a}{(1-p)^{2}}e^{-\frac{1-p}{2a}(n+1)}$ , 2. 2.

$\sum\limits_{i=\left\lfloor\frac{2p(n+1)}{1-2p}\right\rfloor+1}^{\infty}e^{-\frac{(1-p)i-p(n+1)}{a}}\leq 1+\frac{a}{1-p}e^{-\frac{p(n+1)}{a(1-2p)}}$ * .*

Proof.

1. We bound the sum as follows

[TABLE]

Using a substitution of $y=(1-p)x-p(n+1)$ ,

[TABLE]

We bound the exponent from below by considering two cases $y>n+1$ and $n+1\geq y\geq 0$ . We have,

[TABLE]

Hence, we have

[TABLE]

We bound the first integral of Eq. (5) using Fact 20, where $\sigma^{2}=\frac{a(n+1)}{1-p}$ , as follows

[TABLE]

The second integral in Eq. (5) equals

[TABLE]

Combining Eq. (4 - 7) we have

[TABLE]

2. We bound the sum as follows

[TABLE]

Using a substitution of $y=(1-p)x-p(n+1)$ ,

[TABLE]

∎

Theorem 25.

For every $n\geq 1$ and $p\in(0,1)$ we have

[TABLE]

Proof.

Using Fact 18

[TABLE]

Let $N_{i}=i+n+1$ and $r_{i}=(1-p)i-p(n+1)$ . We have $i=pN_{i}+r_{i}$ and therefore, we rewrite Eq. (8) as

[TABLE]

1. First, we focus on the case of $p\leq\frac{1}{2}$ .

Consider $\frac{r_{i}}{pN_{i}}$ and notice that $1>\frac{r_{i}}{pN_{i}}\geq 0$ when $1>\frac{(1-p)i-p(n+1)}{p(i+n+1)}\geq 0$ , which is equivalent to $\frac{2p}{1-2p}(n+1)>i\geq\frac{p}{1-p}(n+1)$ . Also, we note that $\operatorname{E}_{X_{j}\sim Ber(p)}[\sum\limits_{j=1}^{N_{i}}X_{j}]=pN_{i}$ . Using Chernoff bound (Fact 21.1) and Lemma 24.1, with $a=3p$ , we have

[TABLE]

When $i>\frac{2p}{1-2p}(n+1)$ we use the second form of Chernoff bound (Fact 21.2), followed by Lemma 24.2, with $a=3$ , to have

[TABLE]

When $\frac{p}{1-p}(n+1)>i$ we can assume worst-case to get

[TABLE]

By substituting Eq. (10-12) in Eq. (9) we have

[TABLE]

Since $p\leq\frac{1}{2}$ , we have $\frac{1}{2}\leq 1-p$ , thus

[TABLE]

2. Now, consider $p\geq\frac{1}{2}$ . Assume $i\geq\frac{p}{1-p}(n+1)$ and therefore $r_{i}=(1-p)i-p(n+1)\geq pn+p-pn-p=0$ . Using Hoeffding bound (Fact 22.2) we get that

[TABLE]

Thus, by using Lemma 24.1, with $a=2p(1-p)$ , we have

[TABLE]

For $i\leq\frac{p}{1-p}(n+1)$ we assume the worst-case bound to get

[TABLE]

By substituting Eq. (B.2, 14) in Eq. (9) and using Lemma 24.1, with $a=2p(1-p)$ , to have

[TABLE]

∎

Appendix C Proof of the Swapping Lemma (Lemma 4)

We start with the following preliminary lemma that states the probability of an error for $TS(q)$ given a history.

Lemma 26.

Fix a bit sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})\in\{0,1\}^{T}$ . For any $t\in[T]$ we have,

[TABLE]

Proof.

At time $t$ , algorithm $TS(q)$ samples $x_{t}\sim\beta\left(O_{t-1}+1,Z_{t-1}+1\right)$ , and predicts $\hat{\gamma}_{t}=1$ if $x_{t}>q$ and $\hat{\gamma}_{t}=0$ if $x_{t}\leq q$ . Thus, for the case of $\gamma_{t}=0$ ,

[TABLE]

and for the case of $\gamma_{t}=1$ ,

[TABLE]

∎

Now we can prove the Swapping Lemma, which compares the regret of two sequences that differ by a single swap operation.

See 4

Proof.

We consider the difference between the regret of $TS(q)$ on the bit sequence $\Gamma$ and the bit sequence $Swap(\Gamma,t)$ . The two bit sequences differ only at locations $t$ and $t+1$ . Since the benchmark of a sequence depends only on the total number of zeros and ones in the sequence, the benchmarks on $\Gamma$ and $Swap(\Gamma,t)$ are identical, i.e., $statis^{q}(\Gamma)=static^{q}(Swap(\Gamma,t))$ . Therefore, the difference between the regrets is equals to the loss difference at time $t$ and $t+1$ .

Consider time $t\in\left[T\right]$ such that $\gamma_{t}=0$ and $\gamma_{t+1}=1$ .We have,

[TABLE]

where we used Lemma 26 for the equality before last.

By Fact 2, we have the following recurrence relations:

[TABLE]

where $B(a,b)$ is the Beta function. Therefore,

[TABLE]

We now analyse the $sign$ of the terms in Eq. (15). Since $\frac{q^{O_{t-1}+1}(1-q)^{Z_{t-1}+1}}{B\left(O_{t-1}+1,Z_{t-1}+1\right)}>0$ ,

[TABLE]

Thus,

[TABLE]

and equality holds iff $\frac{q}{1-q}=\frac{O_{t-1}+1}{Z_{t-1}+1}$ .

The second case, where $\gamma_{t}=1$ and $\gamma_{t+1}=0$ , is similar. ∎

Appendix D Worst-case regret proofs for $q=\frac{1}{2}$ (Section 5.1)

Consider bit sequences $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ with $k$ zeros, where $k\leq\frac{T}{2}$ zeros. We first show that among these bit sequences the ones of largest regret are of the form $\{01,10\}^{k}1^{T-2k}$ . Then, we prove that the regret of each of these sequences is $\Theta(\sqrt{k})$ .

See 5

Given the above theorem, to bound the worst case regret of $TS(\frac{1}{2})$ , we can focus on the sequence $W_{T}^{k}=\{01\}^{k}1^{T-2k}$ and bound $Regret^{1/2}_{TS(\frac{1}{2})}(W_{T}^{k})$ .

See 6

Proof.

Let $W_{T}^{k}=(w_{1},\ldots,w_{T})$ , where we have: (1) $w_{t}=0$ for $t\in A_{1}=\{2i-1\mid i\in[k]\}$ , (2) $w_{t}=1$ for $t\in A_{2}=\{2i\mid i\in[k]\}$ , and (3) $w_{t}=1$ for $t\in A_{3}=\{i\mid i\geq 2k+1\}$ . We bound the expected number of errors made by $TS(\frac{1}{2})$ on each of these three subsets. Then, from these bounds we derive a bound on the loss and the regret.

The expected number of false positive errors in $A_{1}$ : Note that the only errors at times $t\in A_{1}$ are false positive since $w_{t}=0$ for these $t$ ’s. For $t\in A_{1}$ we have that $t=2i-1$ , and $O_{t-1}=Z_{t-1}=i-1$ . Hence the algorithm $TS(\frac{1}{2})$ predicts $\hat{\gamma}_{t}=0$ and $\hat{\gamma}_{t}=1$ each with probability of $\frac{1}{2}$ and

[TABLE]

When we sum over $t\in A_{1}$ , we have

[TABLE]

The expected number of false negative errors in $A_{2}$ : Note that the only errors at times $t\in A_{2}$ are false negatives since $w_{t}=1$ . For $t\in A_{2}$ we have $t=2i$ , and $O_{t-1}=i-1$ and $Z_{t-1}=i$ . By Lemma 26 and Fact 18 we have

[TABLE]

We can bound $F_{Bin(2i,\frac{1}{2})}(i-1)$ using Fact 29, in the following way

[TABLE]

Summing over $t\in A_{2}$ we have,

[TABLE]

The expected number of false negative in $A_{3}$ : Note that the only errors at times $t\in A_{3}$ are false negative since $w_{t}=1$ for these $t$ ’s. For any $t\in A_{3}$ we have $Z_{t}=k$ . Therefore,

[TABLE]

From Theorem 3 we have

[TABLE]

Summing up the errors over $A_{1}$ , $A_{2}$ , and $A_{3}$ we get that the total number of errors is

[TABLE]

Recall that the regret is the total loss minus the best static bit prediction. Since we assume that $k\leq\frac{T}{2}$ it is equal to

[TABLE]

∎

Appendix E Best-case regret proofs for $q=\frac{1}{2}$ (Section 5.2)

We show that for $k\leq\frac{T}{2}$ , the lowest regret is for the bit sequence $B_{T}^{k}=1^{T-k}0^{k}$ . Then, we prove that its regret is $O(1)$ for any $k\leq\frac{T}{2}$ .

Lemma 27.

For any $\Phi\in\{0,1\}^{T-2m}$ , $Regret^{1/2}_{TS\left(\frac{1}{2}\right)}(0^{m}1^{m}\Phi)=Regret^{1/2}_{TS\left(\frac{1}{2}\right)}(1^{m}0^{m}\Phi)$ .

Proof.

Let $\Gamma^{1}=(\gamma_{1}^{1},\ldots,\gamma_{T}^{1})=(0^{m}1^{m},\Phi)$ and $\Gamma^{2}=(\gamma_{1}^{2},\ldots,\gamma_{T}^{2})=(1^{m}0^{m},\Phi)$ . We show, using Lemma 26, that for each $t\in[T]$ , we have $\operatorname{E}[\mathbb{I}\{\hat{\gamma}_{t}=\gamma_{t}^{1}\}\mid\Gamma^{1}]=\operatorname{E}[\mathbb{I}\{\hat{\gamma}_{t}=\gamma_{t}^{2}\}\mid\Gamma^{2}]$ , which implies that $\Gamma^{1}$ and $\Gamma^{2}$ have the same expected loss. Since static bit prediction also has the same loss on $\Gamma^{1}$ and $\Gamma^{2}$ then they have the same regret.

For $t\leq m$ , by Fact 1, we have

[TABLE]

For $m<t\leq 2m$ we have,

[TABLE]

For $t>2m$ we have $O_{t}(\Gamma^{1})=O_{t}(\Gamma^{2})$ and $Z_{t}(\Gamma^{1})=Z_{t}(\Gamma^{2})$ and thus $\operatorname{E}[\mathbb{I}\{\hat{\gamma}_{t}=\gamma_{t}^{1}\}\mid\Gamma^{1}]=\operatorname{E}[\mathbb{I}\{\hat{\gamma}_{t}=\gamma_{t}^{2}\}\mid\Gamma^{2}]$ . ∎

From that we can induce that $B_{T}^{k}$ has the lowest regret on $TS(q)$ .

See 9

Proof.

Let $\Gamma=(\gamma_{1},\ldots,\gamma_{T})\in\{0,1\}^{T}$ be a bit sequence of length $T$ with $k\leq\frac{T}{2}$ zeros such that $\Gamma\neq 1^{T-k}0^{k}$ . We show that there is a bit sequence $\tilde{\Gamma}$ , that has the same regret as $\Gamma$ , and for some $t\in[T]$ the sequence $Swap(\tilde{\Gamma},t)$ has regret smaller than $\tilde{\Gamma}$ .

Since $\Gamma\neq 1^{T-k}0^{k}$ , then either $\Gamma=0^{k}1^{T-k}$ or it has a prefix of the form $0^{m}1^{n}0$ or $1^{n}0^{m}1$ , where $n,m>0$ .

First, we look at the case where $\Gamma=0^{k}1^{T-k}$ . By Lemma 27, the sequence $\tilde{\Gamma}=1^{k}0^{k}1^{T-2k}$ has the same regret as $\Gamma$ and by Lemma 4, the sequence $Swap(\tilde{\Gamma},2k)$ has regret smaller than the regret of $\tilde{\Gamma}$ .

Second, assume $\Gamma$ has a prefix of $0^{m}1^{n}0$ (the case of $1^{n}0^{m}1$ is similar). We have two sub-cases: (a) If $m\geq n$ then $O_{n+m-1}<Z_{n+m-1}$ and $\gamma_{n+m}=1$ , $\gamma_{n+m+1}=0$ . By Lemma 4, the sequence $Swap(\Gamma,n+m)$ has regret lower than $\Gamma$ . (b) If $m<n$ , by Lemma 27, the bit sequences $\Gamma=(0^{m}1^{m}1^{n-m}0,\gamma_{m+n+2},\ldots,\gamma_{T})$ and $\tilde{\Gamma}=(1^{m}0^{m}1^{n-m}0,\gamma_{m+n+2},\ldots,\gamma_{T})$ have the same regret. By Lemma 4, the sequence $Swap(\tilde{\Gamma},2m)$ has regret smaller than the regret of $\tilde{\Gamma}$ .

For $k=\frac{T}{2}$ , by Lemma 27, both $0^{T/2}1^{T/2}$ and $1^{T/2}0^{T/2}$ have the same regret. ∎

We now bound the regret of $B_{T}^{k}=1^{T-k}0^{k}$ .

See 10

Proof.

For $t\leq T-k$ we have $b_{t}=1$ . Thus

[TABLE]

Using Fact 19, we have

[TABLE]

This implies that the expected number of false negative errors, in steps $t\leq T-k$ , is

[TABLE]

For $t\geq T-k+1$ we can have at most $k$ errors so

[TABLE]

Therefore, the regret of $TS(\frac{1}{2})$ on $B_{T}^{k}$ is bounded by

[TABLE]

∎

Appendix F Worst-case regret proofs for a general $q$ (Sections 6.1 and 6.2)

Recall $H^{q}$ ,

[TABLE]

where $O(\Phi)$ is the total number of $1$ s in $\Phi$ and $Z(\Phi)$ is the total number of [math]s in $\Phi$ . For every sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})\in\{0,1\}^{T}$ we define $p(\Gamma)$ to be the largest index $t$ s.t. $\forall i\in[t]:\gamma_{i}\in H^{q}(\Gamma_{1:i-1})$ , where $\Gamma_{1:n}=(\gamma_{1},\ldots,\gamma_{n})$ . We call a bit sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ a worst-case sequence if $\gamma_{p(\Gamma)+1}=\ldots=\gamma_{T}$ . We define the subsequence $(\gamma_{1},\ldots,\gamma_{p(\Gamma)})$ as the *head * of $\Gamma$ and denote it $head(\Gamma)$ and the subsequence $(\gamma_{p(\Gamma)+1},\ldots,\gamma_{T})$ as the tail of $\Gamma$ and denote it $tail(\Gamma)$ .

For start, we want to bound the number of [math]s and $1$ s in the head of a worst-case sequence.

Lemma 28.

Fix a worst-case sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ and let $t\leq p(\Gamma)$ . Then, if $\gamma_{t}=0$ then $(1-q)t\leq Z_{t}\leq(1-q)t+(1-q)$ and $qt-(1-q)\leq O_{t}\leq qt$ , if $\gamma_{t}=1$ then $(1-q)t-q\leq Z_{t}\leq(1-q)t$ and $qt\leq O_{t}\leq qt+q$ .

Proof.

The proof is by induction on $t$ . For $t=1$ and $q<\frac{1}{2}$ we have that $\frac{q}{1-q}<1$ and therefore $H^{q}$ of an empty sequence equals $\{0\}$ . Thus, as $t\leq p(\Gamma)$ , we must place $\gamma_{1}=0$ . In case of such sequence $(1-q)\leq 1\leq 2(1-q)$ and $2q-1\leq 0\leq q$ .

By the induction hypothesis for both $\gamma_{t-1}=0$ and $\gamma_{t-1}=1$ we have, $(1-q)(t-1)-q\leq Z_{t-1}\leq(1-q)(t-1)+(1-q)$ and $q(t-1)-(1-q)\leq O_{t-1}\leq q(t-1)+q$ .

Case 1 $\gamma_{t}=0$ . Since $t\leq p(\Gamma)$ , we have that $0\in H^{q}(\Gamma_{1:t-1})$ and therefore $\frac{O_{t-1}+1}{Z_{t-1}+1}\geq\frac{q}{1-q}$ . Since $O_{t-1}=O_{t}$ and $Z_{t-1}+1=Z_{t}$ we get that

[TABLE]

Since $Z_{t}+O_{t}=t$ we can substitute $Z_{t}=t-O_{t}$ in Eq. (17) and get that $O_{t}\geq qt-(1-q)$ . Similarly by substituting $O_{t}=t-Z_{t}$ in Eq. (17) we get that $Z_{t}\leq(1-q)t+(1-q)$ . The upper bound on $O_{t}$ and the lower bound on $Z_{t}$ follow directly from our assumption: $Z_{t}=Z_{t-1}+1\geq(1-q)(t-1)-q+1=(1-q)t$ and $O_{t}=O_{t-1}\leq q(t-1)+q=qt$ .

Case 2 $\gamma_{t}=1$ . Since $t\leq p(\Gamma)$ , we have that $1\in H^{q}(\Gamma_{1:t-1})$ and therefore $\frac{O_{t-1}+1}{Z_{t-1}+1}\leq\frac{q}{1-q}$ . Since $O_{t-1}+1=O_{t}$ and $Z_{t-1}=Z_{t}$ we get that

[TABLE]

Since $Z_{t}+O_{t}=t$ we can substitute $Z_{t}=t-O_{t}$ in Eq. (18) and get that $O_{t}\leq qt+q$ . Similarly by substituting $O_{t}=t-Z_{t}$ in Eq. (18) we get that $Z_{t}\geq(1-q)t-q$ . The lower bound on $O_{t}$ and the upper bound on $Z_{t}$ follow directly from our assumption: $Z_{t}=Z_{t-1}\leq(1-q)(t-1)+(1-q)=(1-q)t$ and $O_{t}=O_{t-1}+1\geq q(t-1)-(1-q)+1=qt$ . ∎

From Lemma 28 we characterize the tail of a worst-case sequence.

See 11

Proof.

Let $j=p(\Gamma)$ .

Consider first the case where $Z_{T}\leq(1-q)T-q$ and assume by contradiction that $tail(\Gamma)$ is not empty and it is filled with zeros. It follows from this assumption that $Z_{T}=Z_{j}+(T-j)$ . By Lemma 28 we have that $Z_{j}\geq(1-q)j-q$ , and by combining this inequality with the equality $Z_{T}=Z_{j}+(T-j)$ we get that $Z_{T}\geq(1-q)j-q+T-j=T-qj-q$ . On the other hand we assumed that $Z_{T}\leq(1-q)T-q$ . So by combining these upper and lower bounds on $Z_{T}$ we get that $(1-q)T-q\geq T-qj-q$ and thus $j\geq T$ . This is a contradiction to the assumption that $tail(\Gamma)$ is not empty.

Consider now the case where $Z_{T}\geq(1-q)T-q+1$ and assume by contradiction that $tail(\Gamma)$ is not empty and it is filled with ones. It follows from this assumption that $O_{T}=O_{j}+(T-j)$ . By Lemma 28 we have that $O_{j}\geq qj-(1-q)$ , and by combining this inequality with the equality $O_{T}=O_{j}+(T-j)$ we get that $O_{T}\geq qj-(1-q)+T-j=T-(1-q)j-(1-q)$ . On the other hand we assumed that $O_{T}=T-Z_{T}\leq qT-(1-q)$ . So by combining these upper and lower bounds on $O_{T}$ we get that $qT-(1-q)\geq T-(1-q)j-(1-q)$ and thus $j\geq T$ . This is a contradiction to the assumption that $tail(\Gamma)$ is not empty. ∎

Now we prove that all the worst-case sequences have the largest regret and bound it.

See 12

Proof.

Let $i=p(\Gamma)+1$ . Since $\Gamma$ is not a worst-case sequence, there is an index $j>i$ such that $\gamma_{j}\not=\gamma_{i}$ (since $tail(\Gamma)$ contains both [math]’s and $1$ ’s). Assume $j$ is the smallest index with this property.

Case 1 Assume $\gamma_{i}=0$ and $\gamma_{j}=1$ . Since $\gamma_{i}\notin H^{q}(\Gamma_{1:i-1})$ we have $\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}<\frac{q}{1-q}$ . From the definition of $j$ follows that $\gamma_{i}=\gamma_{i+1}=\ldots=\gamma_{j-1}=0$ and thus $\frac{O_{j-2}(\Gamma)+1}{Z_{j-2}(\Gamma)+1}\leq\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}<\frac{q}{1-q}$ . By Lemma 4, the sequence $Swap(\Gamma,j-1)$ has a regret larger than $\Gamma$ .

Case 2 Assume $\gamma_{i}=1$ and $\gamma_{j}=0$ . Since $\gamma_{i}\notin H^{q}(\Gamma_{1:i-1})$ we have $\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}>\frac{q}{1-q}$ . From the definition of $j$ follows that $\gamma_{i}=\gamma_{i+1}=\ldots=\gamma_{j-1}=1$ and thus $\frac{O_{j-2}(\Gamma)+1}{Z_{j-2}(\Gamma)+1}\geq\frac{O_{i-1}(\Gamma)+1}{Z_{i-1}(\Gamma)+1}>\frac{q}{1-q}$ . By Lemma 4, the sequence $Swap(\Gamma,j-1)$ has a regret larger than $\Gamma$ . ∎

Theorem 12 implies that any sequence of largest regret is a worst-case sequence. Next we prove that all worst-case sequences of length $T$ with $k$ zeros have the same regret.

See 13

Proof.

Assume by contradiction that there are two worst-case sequences such that

$Regret^{q}_{TS(q)}(\Gamma^{1})$ $=r_{1}$ , $Regret^{q}_{TS(q)}(\Gamma^{2})=r_{2}$ and $r_{1}\not=r_{2}$ . We assume further that $\Gamma^{1}$ and $\Gamma^{2}$ have the longest common prefix among all worst-case sequences of length $T$ with $k$ zeros and regret $r_{1}$ and $r_{2}$ , respectively.

Since $\Gamma^{1}$ and $\Gamma^{2}$ both have $k$ zeros then by Corollary 11 their tails are filled with the same bit. It follows that $head(\Gamma^{1})\not=head(\Gamma^{2})$ . Assume without loss of generality that $head(\Gamma^{2})$ is not shorter than $head(\Gamma^{1})$ . We claim that $head(\Gamma^{1})$ is not a prefix of $\Gamma^{2}$ . This follows since otherwise $\Gamma^{1}$ and $\Gamma^{2}$ cannot both have $k$ zeros.

It follows that there exists an index $t\leq p(\Gamma^{1})$ such that $\gamma^{1}_{t}\not=\gamma^{2}_{t}$ . Let $t$ be the smallest such index. Since $\Gamma^{1}_{1:t-1}=\Gamma^{2}_{1:t-1}$ we have that $\frac{O_{t-1}(\Gamma^{1})+1}{Z_{t-1}(\Gamma^{1})+1}=\frac{O_{t-1}(\Gamma^{2})+1}{Z_{t-1}(\Gamma^{2})+1}=\frac{q}{1-q}$ . Assume that $\gamma^{1}_{t}=0$ and $\gamma^{2}_{t}=1$ . Therefore, there is an index $t^{\prime}>t$ such that $\gamma^{1}_{t^{\prime}}=1$ and $\gamma^{2}_{t^{\prime}}=0$ . Since the tails of both sequences are filled with the same bit then this implies that $t^{\prime}\leq p(\Gamma^{2})$ and therefore since $t+1\leq t^{\prime}$ we have that $t+1\leq p(\Gamma^{2})$ .

Since $\gamma^{2}_{t}=1$ we have that $\frac{O_{t}(\Gamma^{2})+1}{Z_{t}(\Gamma^{2})+1}>\frac{O_{t-1}(\Gamma^{2})+1}{Z_{t-1}(\Gamma^{2})+1}=\frac{q}{1-q}$ , and since $t+1\leq p(\Gamma^{2})$ we must have that $\gamma^{2}_{t+1}=0$ . By Lemma 4, $Regret^{q}_{TS(q)}(\Gamma^{2})=Regret^{q}_{TS(q)}\left(Swap(\Gamma^{2},t)\right)=r_{2}$ . It is easy to check that $Swap(\Gamma^{2},t)$ is still a worst-case sequence and since it has a longer common prefix with $\Gamma^{1}$ we get a contradiction to the choice of $\Gamma^{1}$ and $\Gamma^{2}$ .

The case where $\gamma^{1}_{t}=1$ and $\gamma^{2}_{t}=0$ is analogous. ∎

Let $W_{T}^{k}=(w_{1},\ldots,w_{T})\in\{0,1\}^{T}$ be a worst-case sequence with $k$ zeros such that for all $t\leq p(W_{T}^{k})$ with $\frac{O_{t-1}+1}{Z_{t-1}+1}=\frac{q}{1-q}$ we have $\gamma_{t}=0$ . Since by Lemma 13 all the worst-case sequences with the same number of zeros have the same regret, we can focus on bounding the regret of $W_{T}^{k}$ .

See 14

Proof.

We first consider the case that $k\leq(1-q)T-q$ . We partition $W_{T}^{k}$ into the following sets (1) $A_{1}=\{t\mid t\in[p(W_{T}^{k})]\text{ and }w_{t}=0\}$ , (2) $A_{2}=\{t\mid t\in[p(W_{T}^{k})]\text{ and }w_{t}=1\}$ , and (3) $A_{3}=\{t\mid t\geq p(W_{T}^{k})+1\}$ . We bound the expected number of errors made by $TS(q)$ on each of these three subsets. Then, from these bounds we derive a bound on the loss and the regret.

The expected number of false positive errors in $A_{1}$ : Note that the only errors at times $t\in A_{1}$ are false positive since $w_{t}=0$ for these $t$ ’s. Therefore, by Lemma 26 and Fact 18 we have

[TABLE]

By the definition of $A_{1}$ , $t\leq p(W_{T}^{k})$ , and therefore by Lemma 28, $(1-q)t\leq Z_{t}$ . Thus, $t\leq\frac{Z_{t}}{1-q}\leq\frac{Z_{t}+1-1+q}{1-q}=\frac{Z_{t}+1}{1-q}-1\leq\left\lfloor\frac{Z_{t}+1}{1-q}\right\rfloor$ . Let $m=\left\lfloor\frac{Z_{t}+1}{1-q}\right\rfloor$ and $X\sim Bin\left(m,1-q\right)$ . We can bound the right side of Eq. (F) as follows.

[TABLE]

We now bound the different probabilities in Eq. (F). Since $X$ is a Binomial random variable, its median is $\left\lfloor m(1-q)\right\rfloor=Z_{t}$ or $\left\lceil m(1-q)\right\rceil=Z_{t}+1$ and thereby

[TABLE]

For any $Z_{t}\geq\frac{2(1-q)}{q}-1$ , we bound $\Pr(X=Z_{t}+1)$ by Lemma 30 as follows

[TABLE]

The probability $\Pr(X=Z_{t})$ is bounded using the previous equality,

[TABLE]

Therefore by using Eq. (22) and (F) we have

[TABLE]

By substituting Eq. (F-22,24) into (F) we get that for $Z_{t}\geq\frac{2(1-q)}{q}-1$

[TABLE]

For $Z_{t}<\frac{2(1-q)}{q}-1$ we assume the worst-case to have

[TABLE]

Notice that since $k\leq(1-q)T-q$ , by Corollary 11 there are no zeros in the tail.Thus, all the zeros of $W_{T}^{k}$ are in $A_{1}$ . Thus, we use Eq. (25-26) to sum over all $t\in A_{1}$ .

[TABLE]

The expected number of false negative errors in $A_{2}$ : Note that the only errors at times $t\in A_{2}$ are false negative since $w_{t}=1$ . Therefore, by Lemma 26 and Fact 18 we have

[TABLE]

By the definition of $A_{2}$ , $t\leq p(W_{T}^{k})$ , and therefore by Lemma 28, $qt\leq O_{t}$ . Thus, $t\leq\frac{O_{t}}{q}\leq\frac{O_{t}+1-q}{q}=\frac{O_{t}+1}{q}-1\leq\left\lfloor\frac{O_{t}+1}{q}\right\rfloor$ . Let $m=\left\lfloor\frac{O_{t}+1}{q}\right\rfloor$ and $X\sim Bin\left(m,q\right)$ . We can continue and bound the right side of Equation (28) as follows.

[TABLE]

Note that we have analogous bounds to the previous case of $A_{1}$ , since by substituting $Z_{t}$ and $1-q$ by $O_{t}$ and $q$ respectively in Eq. (F,F) we get Eq. (28,F). Thereby,

[TABLE]

Since $head(W^{k}_{T})$ contains all the zeros in $W^{k}_{T}$ we have $Z_{p(W^{k}_{T})}=k$ . By using Lemma 28 we get that $(1-q)p(W^{k}_{T})-q\leq Z_{p(W^{k}_{T})}$ and thus $p(W^{k}_{T})\leq\frac{k+q}{1-q}$ . Therefore, $O_{p(W^{k}_{T})}=p(W^{k}_{T})-Z_{p(W^{k}_{T})}\leq\frac{k+q}{1-q}-k\leq\frac{q}{1-q}k+1$ .

Let $n=\left\lceil\frac{q}{1-q}k\right\rceil+1$ . By Eq. (30), we sum over all $t\in A_{2}$ to have

[TABLE]

where the one before last inequality follows from substitution of $n=\left\lceil\frac{q}{1-q}k\right\rceil+1\leq\frac{q}{1-q}k+2$ .

The expected number of false negative in $A_{3}$ : By Corollary 11 the only errors at times $t\in A_{3}$ are false negative since $w_{t}=1$ . For any $t\in A_{3}$ we have $Z_{t}=k$ . Therefore,

[TABLE]

From Lemma 28, $(1-q)p(W^{k}_{T})+(1-q)\geq Z_{p(W^{k}_{T})}=k$ and thus $p(W^{k}_{T})\geq\frac{k}{1-q}-1$ . From Theorem 25 we have

[TABLE]

where the inequality follows from $t-k=\left\lfloor\frac{k}{1-q}\right\rfloor-1-k\geq\frac{k}{1-q}-k-2=\frac{qk}{1-q}-2=i$ .

Since $k\leq(1-q)T-q$ , the best static bit predictor is

[TABLE]

By using Eq. (F), (31) and (F), the regret is the total loss minus the best static bit prediction

[TABLE]

We now look at the regret for $k\geq(1-q)T-q$ . In this proof, we split the calculations into $A_{1},A_{2}$ and $A_{3}$ as in the prior part.

The expected number of false positive errors in $A_{1}$ : At each $t\in A_{1}$ the expected errors are bounded in the same way as in the previous case. The only change is the size of $A_{1}$ . Notice that since $k>(1-q)T-q$ , by Corollary 11 all ones of $W_{T}^{k}$ are in $A_{1}$ . By the definition of $A_{1}$ , $t\leq p(W_{T}^{k})$ , and therefore by Lemma 28, $qt-(1-q)\leq O_{p(W_{T}^{k})}=T-k$ and thus $t\leq\frac{T-k+1}{q}$ . From Lemma 28 we also conclude that $Z_{p(W_{T}^{k})}\leq(1-q)p(W_{T}^{k})+(1-q)\leq(1-q)\frac{T-k+1}{q}+(1-q)$ . In total, $|A_{1}|\leq(1-q)\frac{T-k+1}{q}+(1-q)$ . Thereby, the expected number of errors in $A_{1}$ is bounded by $\frac{\frac{1-q}{q}(T-k)}{2}+O\left(\frac{\sqrt{(1-q)(T-k)}}{q}+\frac{1-q}{q}\right)$ .

The expected number of false negative errors in $A_{2}$ : At each $t\in A_{2}$ the expected errors are bounded in the same way as in the previous case. The only change is the size of $A_{2}$ , which equals to $T-k$ since from Corollary 11 all the ones of $W^{k}_{T}$ are in $head(W^{k}_{T})$ . Thus we have that the expected number of errors is bounded by $\frac{T-k}{2}+O\left(\sqrt{\frac{T-k}{1-q}}+\frac{q}{1-q}\right)$ .

The expected number of false negative in $A_{3}:$ By Corollary 11 the only errors at times $t\in A_{3}$ are false negative since $w_{t}=0$ . For any $t\in A_{3}$ we have $O_{t}=T-k$ . Therefore,

[TABLE]

From Lemma 28, $qp(W^{k}_{T})+q\geq O_{p(W^{k}_{T})}=T-k$ and thus $p(W^{k}_{T})\geq\frac{T-k}{q}-1$ . From Theorem 25, since $1-q\geq\frac{1}{2}$ , we have

[TABLE]

Since $k>(1-q)T-q$ , the best static bit predictor is

[TABLE]

Hence, the regret in the case is

[TABLE]

∎

See 15

Proof.

Fix $q\in\left[\frac{1}{2},1\right]$ and a bit sequence $\Gamma=(\gamma_{1},\ldots,\gamma_{T})$ . We show that $Regret^{q}_{TS(q)}\left(\Gamma\right)=Regret^{q}_{TS(1-q)}\left(\bar{\Gamma}\right)$ . At each step $t\in\left[T\right]$ , $O_{t}\left(\Gamma\right)=Z_{t}\left(\bar{\Gamma}\right)$ . Therefore by Fact 1 we have

[TABLE]

The benchmarks are the same as,

[TABLE]

We conclude that $Regret^{q}_{TS(q)}\left(\Gamma\right)=Regret^{1-q}_{TS(1-q)}\left(\bar{\Gamma}\right)$ . ∎

See 16

Proof.

Assume $q\in\left[0,\frac{1}{2}\right]$ . From Theorem 12 the bit sequences that generate the largest regret, with $k$ zeros, are worst-case sequences. Theorem 14 shows that the regret of these bit sequences is

[TABLE]

Thus, the worst-case regret over all $k$ ’s is

[TABLE]

For $q\in\left[\frac{1}{2},1\right]$ , Lemma 15 with Theorem 14 gives us the same regret of $O\left(\sqrt{q(1-q)T}\right)$ . ∎

Appendix G Best-case regret proofs for a general $q$ (Section 6.3)

See 17

Proof.

First we calculate the loss of $\Gamma_{1}=1^{n}0^{m}$ . For $t\leq n$ we have $\gamma_{t}=1$ . Thus, by using Lemma 26,

[TABLE]

Using Fact 19, we have

[TABLE]

This implies that the expected number of false negative errors, in steps $t\leq n$ , is

[TABLE]

For $t\geq n+1$ we can have at most $m$ errors so

[TABLE]

Therefore, the expected loss of $TS(q)$ on $\Gamma_{1}$ is bounded by

[TABLE]

Analogously, we bound the expected loss of $TS(q)$ on $\Gamma_{2}=0^{m}1^{n}$ by

[TABLE]

The benchmark of the two sequences is the same and equals

[TABLE]

Therefore, if $\min\{qm,(1-q)n\}=qm$ then by Eq. (G)

[TABLE]

Otherwise $\min\{qm,(1-q)n\}=(1-q)n$ and by Eq. (34)

[TABLE]

∎

Appendix H Binomial

coefficient approximations

We use the following well known approximation of the Binomial coefficient using Stirling’s approximation. (see for example, [39])

Fact 29.

For every $m\in\mathbb{N}^{+}$ and $n\leq m$ we have

[TABLE]

From the fact above we conclude the following lemma.

Lemma 30.

For every constant $p\in(0,1)$ and $n\geq\frac{2p}{1-p}$ , we have

[TABLE]

Proof.

Let $m=\left\lfloor\frac{n}{p}\right\rfloor$ . We bound $\binom{m}{n}$ using Fact 29 as follows

[TABLE]

From the definition of floor $\exists\omega\in[0,1):m=\frac{n}{p}-\omega$ and therefore

[TABLE]

Since $0\leq p\omega<p$ we have

[TABLE]

Since $n\geq\frac{2p}{1-p}$ we get that $\frac{(1-p)n}{2}\geq p$ and therefore $(1-p)n-p\geq\frac{(1-p)n}{2}$ . Thus, by using Eq. (H),

[TABLE]

We bound $\left(\frac{n}{(1-p)n-p}\right)^{\frac{n}{p}-\omega-n}$ as follow

[TABLE]

where the last inequality holds as $\left(1-\frac{p}{(1-p)n}\right)^{\frac{(1-p)n}{p}}$ is a monotonic increasing function and since $n\geq\frac{2p}{1-p}$ , the function has a minimum at $n=\frac{2p}{1-p}$ .

From Eq. (36,H) we have

[TABLE]

∎

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Nicolò Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games . Cambridge University Press, 2006.
2[2] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1–122, 2012.
3[3] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. http://downloads.tor-lattimore.com/banditbook/book.pdf, 2019.
4[4] Aleksandrs Slivkins. Introduction to multi-armed bandits. ar Xiv preprint ar Xiv:1904.07272 , 2019.
5[5] T.M. Cover. Behavior of sequential predictors of binary sequences. In Transactions of the Fourth Prague Conference on Information Theory , 1966.
6[6] Alexander Rakhlin and Karthik Sridharan. Statistical learning and sequential prediction. http://www.mit.edu/ rakhlin/courses/stat 928/stat 928_notes.pdf, 2014.
7[7] William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3–4):285–294, 1933.
8[8] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning , 11(1):1–96, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Thompson Sampling for Adversarial Bit Prediction

Abstract

1 Introduction

1.1 Other related work

2 Model

2.1 Distributions

Fact 1**.**

Fact 2**.**

Theorem 3**.**

2.2 Notations

3 Thompson sampling for bit prediction

4 Swapping Lemma

Lemma 4** (Swapping Lemma).**

5 Regret characterization for

5.1 Worst-case regret

Theorem 5**.**

Proof.

Theorem 6**.**

Corollary 7**.**

Remark 8**.**

5.2 Best-case regret

Theorem 9**.**

Theorem 10**.**

6 Regret characterization for a general qqq

6.1 Worst-case sequences

Theorem 11**.**

6.2 Worst-case regret

Theorem 12**.**

Proof.

Lemma 13**.**

Theorem 14**.**

Lemma 15**.**

Theorem 16**.**

6.3 Best-case regret bound

Theorem 17**.**

Appendix A Beta and Binomial concentration bounds

Fact 18**.**

Fact 19**.**

Fact 20**.**

Fact 21**.**

Fact 22**.**

Appendix B Proof of bounds on sums of Beta CDFs (Theorems 3 and 25)

B.1 Proof of Theorem 3

Lemma 23**.**

Proof.

Proof.

B.2 Proof of Theorem 25

Lemma 24**.**

Proof.

Theorem 25**.**

Proof.

Appendix C Proof of the Swapping Lemma (Lemma 4)

Lemma 26**.**

Proof.

Proof.

Appendix D Worst-case regret proofs for q=12q=\frac{1}{2}q=21​ (Section 5.1)

Proof.

Appendix E Best-case regret proofs for q=12q=\frac{1}{2}q=21​ (Section 5.2)

Lemma 27**.**

Proof.

Proof.

Proof.

Appendix F Worst-case regret proofs for a general qqq (Sections 6.1 and 6.2)

Lemma 28**.**

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Proof.

Appendix G Best-case regret proofs for a general qqq (Section 6.3)

Proof.

Fact 1.

Fact 2.

Theorem 3.

Lemma 4 (Swapping Lemma).

Theorem 5.

Theorem 6.

Corollary 7.

Remark 8.

Theorem 9.

Theorem 10.

6 Regret characterization for a general $q$

Theorem 11.

Theorem 12.

Lemma 13.

Theorem 14.

Lemma 15.

Theorem 16.

Theorem 17.

Fact 18.

Fact 19.

Fact 20.

Fact 21.

Fact 22.

Lemma 23.

Lemma 24.

Theorem 25.

Lemma 26.

Appendix D Worst-case regret proofs for $q=\frac{1}{2}$ (Section 5.1)

Appendix E Best-case regret proofs for $q=\frac{1}{2}$ (Section 5.2)

Lemma 27.

Appendix F Worst-case regret proofs for a general $q$ (Sections 6.1 and 6.2)

Lemma 28.

Appendix G Best-case regret proofs for a general $q$ (Section 6.3)

Fact 29.

Lemma 30.