Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds
Shinji Ito, Kei Takemura

TL;DR
This paper introduces a hierarchical adaptive linear bandit algorithm that achieves optimal regret bounds across adversarial and stochastic environments, with variance-adaptive performance and data-dependent regret guarantees.
Contribution
It presents a novel hierarchical algorithm that attains best-of-three-worlds regret bounds and incorporates new techniques for high-level and low-level adaptability in linear bandits.
Findings
Achieves ${O}(\sqrt{T \log T})$ regret in adversarial settings.
Attains $O(rac{\log T}{\Delta_{ ext{min}}} + \sqrt{rac{C \log T}{\Delta_{ ext{min}}})$ regret in stochastic environments with corruption.
Provides variance-adaptive regret bounds of $O(rac{\sigma^2 \log T}{\Delta_{ ext{min}}})$.
Abstract
This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of for adversarial environments and of for stochastic environments with adversarial corruptions, where , , and denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent…
| Parameter | Description |
|---|---|
| time horizon | |
| dimensionality of action set | |
| action set (One may assume w.l.o.g.) | |
| parameter of a self-concordant barrier used in the algorithm | |
| (One can choose so that ) | |
| minimum suboptimality gap: | |
| asymptotic lower bound parameter: | |
| maximum variance of loss: | |
| minimum cumulative loss: | |
| total quadratic variation in loss sequence: | |
| path-length of loss sequence: |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Data Stream Mining Techniques
Best-of-Three-Worlds Linear Bandit Algorithm
with Variance-Adaptive Regret Bounds
Shinji Ito NEC Corporation. Email: [email protected], [email protected].
Kei Takemura∗
Abstract
This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of for adversarial environments and of for stochastic environments with adversarial corruptions, where , , and denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence. In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of as well, where denotes the maximum variance of the feedback loss. The proposed algorithm is based on the SCRiBLe algorithm. By incorporating into this a new technique we call scaled-up sampling, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.
1 Introduction
This paper considers linear bandit problems. In this class of problems, a player chooses, in each round , an action from a given action set , which is a subset of a -dimensional linear space. The player then observes the incurred loss , where the (conditional) expectation of is assumed to be a linear function, i.e., is expressed as with some vector and some noise . The performance of the player is evaluated in terms of of regret defined as and .
Algorithms for linear bandit problems have been proposed mainly for two different types of environments: stochastic and adversarial. In stochastic environments, are assumed to follow an unknown distribution independently for all . Consequently, we may assume that there exists such that and follows an identical distribution for all .111 In standard stochastic settings, it is assumed that follows a zero-mean distribution for all . Our proposed algorithm, however, works well under milder assumptions, details of which are given in Section 2.1 and Remark 3.
In adversarial environments, the distributions of (and thus also ) are decided arbitrarily depending on the action sequence that the player has chosen so far.
What we can do in linear bandit problems varies greatly depending on the type of environment. For stochastic environments, it is known that the optimal regret is of (Lattimore and Szepesvari, 2017), ignoring the factor dependent on and . For adversarial environments, the mini-max optimal regret is (Bubeck et al., 2012), where we ignore poly-logarithmic factors with respect to and in the notation of and . A class of intermediate settings between these two types of environments are called stochastic environments with adversarial corruption (Lykouris et al., 2018), or corrupted stochastic environments. Environments in this regime are parametrized by corruption level , which measures the amount of the adversarial component. For this setting, an algorithm achieving -regret has been proposed (Lee et al., 2021).
The aim of this paper is to make possible the construction of adaptive algorithms that automatically exploit certain specific characteristics of environments. In existing studies of bandit algorithms, the concept of adaptability has been considered at two different levels, regarding which we here refer to high-level adaptability and low-level adaptability. Algorithms with high-level adaptability are designed to work well for different types of environments, e.g., stochastic and adversarial types. Algorithms with low-level adaptability perform better in specific individual environments by exploiting certain favorable characteristics that they possess, e.g., small cumulative loss or small variance in loss sequences.
High-level-adaptive bandit algorithms that perform (nearly) optimally for both stochastic and adversarial environments are called best-of-both-worlds (BOBW) algorithms (Bubeck and Slivkins, 2012). Among such algorithms, those that can adapt to corrupted stochastic environments are referred to as best-of-all-worlds (Erez and Koren, 2021) or best-of-three-worlds (BOTW) algorithms (Lee et al., 2021). For linear bandit problems, Lee et al. (2021) provide a best-of-three-worlds algorithm that achieves regret bounds of for stochastic environments, of for adversarial environments, and of for corrupted stochastic environments.
Various types of low-level-adaptive algorithms have been considered for adversarial bandit problems. Representative examples are algorithms with -regret, where represents the cumulative loss for the optimal action; such examples are said to have first-order regret bounds. In addition to such algorithms, Hazan and Kale (2011) proposed an algorithm with a second-order regret bound of that depends on the total quadratic variation of loss vectors. An algorithm by Ito (2021) achieves -regret, which means that the algorithm simultaneously has first-order and second-order bounds as well as a bound depending on the path-length of the loss sequence. These regret bounds, which are referred to as data-dependent regret bounds, imply that algorithm performance can be improved by exploiting certain environmental characteristics that are common in applications, such as small variations in loss sequences or sparsity of loss. For the stochastic multi-armed bandit problem, Audibert et al. (2007) proposed an algorithm with an -regret bound that depended not only on the sub-optimality gap but also on the variance of the loss. We refer to such bounds as variance-adaptive bounds, and they can be considered to represent low-level adaptability in stochastic regimes.
1.1 Contribution of this work
The main contribution of this paper is the proposal of a linear bandit algorithm that combines high-level adaptability and low-level adaptability. It is a BOTW algorithm that achieves regret bounds of in stochastic environments, in adversarial environments, and in corrupted stochastic environments, ignoring factors depending on and . Further, the algorithm achieves first-order, second-order, and path-length bounds in adversarial environments. Simultaneously, it has variance-adaptive regret bounds for (corrupted) stochastic environments.
The proposed algorithm (Algorithm 1) follows the approach of SCRiBLe (Abernethy et al., 2008a, 2012) which stands for Self-Concordant Regularization in Bandit Learning. This approach uses a class of functions known as self-concordant barriers (Nesterov and Nemirovskii, 1994) as regularizers. Self-concordant barriers are characterized with a parameter that can be assumed to satisfy , details of which are given in Section 2.3. The regret bounds of our algorithm can be expressed with parameters explained in Table 1, including the parameter , as follows:
Theorem 1** (informal).**
In adversarial environments with , the regret for Algorithm 1 is bounded as . Further, if for all and , we have . In stochastic environments (i.e., if for all ), we have . In corrupted stochastic environments with the corruption level , we have .
Table 2 provides a comparison of our regret bounds with those in existing studies. For stochastic settings, the tight asymptotic regret given and can be characterized with , a definition of which can be found in, e.g., the paper by Lattimore and Szepesvari (2017). They have provided an algorithm that achieves an asymptotically optimal regret bound of . However, such asymptotically optimal algorithms are not necessarily optimal in environments with small variance . In the case of , the proposed algorithm provides a better regret bound. We would also like to emphasize the fact that our stochastic regret bound includes only a single factor, while the bound by the BOTW algorithm of Lee et al. (2021) includes a factor.
For adversarial environments, Ito (2021) has provided an algorithm with data-dependent regret bounds that depend on , , and simultaneously. In this regard, our regret bounds here have an additional factor of but are better in terms of the dependency w.r.t. . Our regret bounds can be better than those with BOTW algorithm by Lee et al. (2021) if the loss sequence satisfies .
For corrupted stochastic environments, the BOTW algorithm by Lee et al. (2021) achieves a regret bound of , while our bound is , where satisfies . As follows from the AM-GM inequality, our algorithm also implies , which is superior to the bound by Lee et al. (2021) when . We would like to stress here that the impact of corruption on the performance of our algorithm is only of a square-root factor in , while algorithms in existing studies (Lee et al., 2021; Li et al., 2019; Bogunovic et al., 2021) include at least a linear factor in . Comparison of such results w.r.t. corrupted settings, however, requires particular care, as there are differences in the details of problem settings.
Remark 1**.**
In this paper, regret is defined in terms of loss including corruption, while existing studies define regret in terms of loss without corruption. As the difference between these two notions of regret is at most , our algorithm enjoys -bound even in terms of the latter definition of regret. Such a difference in models has been discussed by Gupta et al. (2019).
The main innovations for achieving high-level adaptability (i.e., the BOTW property) are with regard to the sampling method for actions. In a previous study by Abernethy et al. (2008b), they compute a point in the convex hull of the action set using a follow-the-regularized-leader (FTRL) approach, and they then pick from the Dikin ellipsoid that is defined from the self-concordant barrier for . Here, the action must be sampled so that its expectation matches . In addition, the larger the variance of , the better estimator for that we can construct, i.e., the smaller variance of . In this paper, in order to improve the variance of the loss estimator, we introduce a new technique that we refer to as scaled-up sampling (see Figure 1). In this approach, we construct a scaled-up set of the Dikin ellipsoid with a reference point , for which we let denote the scaling factor. Rather than sampling from as is done in the previous study, we pick from with probability , and otherwise set (the expectation of then matches as well). Consequently, the variance of becomes times larger and the variance of the loss-vector estimator becomes times smaller, which contributes to the improvement of the regret upper bound. In stochastic environments in particular, intuitively, approaches an extreme point (a truly optimal solution), allowing for a smaller and a larger value of , which leads to a significant improvement in regret.
In proving the high-level adaptability, we use the self-bounding technique (Zimmert and Seldin, 2021; Wei and Luo, 2018). We first show that the proposed algorithm, an FTRL method with scaled-up sampling and an adaptive learning rate, achieves a regret bound of . We further show that holds in any stochastic environment, where denotes the round-wise regret caused by choosing . Combining these two facts, we can obtain , which immediately leads to in stochastic environments. As has been done in previous analyses using the self-bounding technique, we can prove improved regret bounds for the stochastically constrained adversarial regime (Zimmert and Seldin, 2021) as well, which includes corrupted stochastic environments.
To achieve low-level adaptability (i.e., data-dependent bounds in adversarial environments and variance-adaptive bounds in stochastic environments), we employ the framework of optimistic online learning (Rakhlin and Sridharan, 2013). This framework incorporates optimistic prediction for into online learning algorithms, thereby providing regret bounds depending on rather than . The proposed algorithm determines by means of the technique of tracking the best linear predictor, which leads the hybrid data-dependent bounds and variance-adaptive bounds. Similar approaches can be found in (Ito, 2021; Ito et al., 2022).
1.2 Limitation of this work and future work
We should note the issue of computational complexity w.r.t. the proposed algorithm. In the proof of -regret bounds for (corrupted) stochastic environments, we need the assumption that the reference point , illustrated in Figure 1, is chosen so that the scaling factor is (approximately) maximized. We have not, however, found an efficient method for computing such a point . A naive method for computing such a requires a computational time of at least , which is highly expensive, e.g., as in most examples of combinatorial bandits (Cesa-Bianchi and Lugosi, 2012). Resolving this issue of computational complexity will be important in future work.
There is still some room for improvement in terms of regret bounds as well. As can be seen from Example 4 by Lattimore and Szepesvari (2017), the gap between and can be arbitrarily large, which implies that our stochastic regret bound is much larger than the lower bound in the worst case. We also note that our regret bounds only hold in expectation while regret guarantees by Lee et al. (2021) hold with high probability. If we pursue high probability bounds as well, we cannot avoid an extra factor, as discussed in their Appendix D,
2 Preliminary
2.1 Problem setup
This section introduces the setup of the linear bandit problems dealt with in this paper. Before a game starts, the player is given the time horizon and an action set , a closed and bounded set of -dimensional vectors. Without loss of generality, we assume that is not included in any proper affine subspace of . We also assume that all points in have norm of at most , i.e., , where denotes an ball of the radius : . In each round , the environment determines a loss function , and the player then chooses an action without knowing . After that, the player observes the incurred loss . The loss function can be chosen depending on the actions selected so far. We assume that the conditional expectation of given is an affine function, i.e., there exists , which is referred to as a loss vector, and such that is expressed as
[TABLE]
This paper also assumes that . By imposing further conditions on , we can express a variety of regimes, as are discussed below:
Stochastic regime
In a stochastic regime, it is assumed that follows an unknown distribution for all independently. This assumption implies that and do not change over all rounds, i.e., there exists a true loss vector and such that and hold for all . Note here that standard stochastic settings also assume that functions of represent zero-mean noise, i.e., . This assumption is, however, not necessary in the algorithm proposed in this paper. Moreover, the proposed algorithm does not even require the assumption that follows an identical distribution, details of which will be discussed in Section 4.1.
Adversarial regime
In the adversarial regime, by way of contrast to the stochastic regime, is an arbitrary sequence. More precisely, can be chosen in an adversarial way depending on . Though adversarial environments considered in previous studies are often free from noise, i.e., is assumed, most algorithms work well as long as the noise follows bounded zero-mean distributions. The proposed algorithm in this paper does not require this assumption as well.
Stochastic regime with adversarial corruption
The stochastic regime with adversarial corruption is an intermediate regime between stochastic and adversarial regimes. It is parametrized by a true loss vector and by a corruption level . In this regime, the sequence of is subject to the constraint that . This can be interpreted as a situation in which an adversary adds a corruption of to the loss function defined by and the magnitude of sums up to at most, i.e., . If we set the condition level to zero, this regime coincides with the stochastic regime. On the other hand, if , then the regime is adversarial as there are no constraints on except for .
2.2 Follow the regularized leader
In the proposed algorithm, we use the framework of (optimistic) follow-the-regularized-leader (FTRL) methods. In this framework, we choose a point in a closed convex set by solving the following optimization problem:
[TABLE]
where is the (estimated) loss vector, is an optimistic prediction, and is a regularization term, which is a differentiable convex function over . Note that the original FTRL framework here does not employ optimistic prediction, i.e., the value of is fixed to [math]. The technique of optimistic prediction has been introduced to further improve the performance of FTRL, e.g., by Rakhlin and Sridharan (2013).
In the analysis of FTRL, we use the Bregman divergence associated with some differentiable convex function defined as follows:
[TABLE]
where denotes the gradient of at . We can easily see that for any and , which follows from the convexity of . The following lemma provides an upper bound of the regret for FTRL:
Lemma 1**.**
We assume that and hold for all and . If is given by (2), it holds for any that
[TABLE]
where is defined by .
This lemma can be shown via a standard analysis for FTRL, e.g., as in Chapter 28 of Lattimore and Szepesvári (2019). We can also refer to, e.g., the proof of Lemma 1 by Ito et al. (2022).
2.3 Self-concordant barriers
In our proposed algorithm, we use self-concordant barriers to define regularization terms, just as Abernethy et al. (2008b) did. Self-concordant barriers are defined as follows:
Definition 1**.**
A convex function of class is called a self-concordant function if (i) holds for any and , and (ii) tends to infinity along every sequence converging to a boundary point of , where denotes the value of the -th differential of at along the directions . Let be a non-negative real number. A self-concordant function is called a -self-concordant barrier for if holds for any and .
Remark 2**.**
For any convex set , there exists a -self-concordant barrier for (Lee and Yue, 2021). This barrier is, however, not always efficiently computable. On the other hand, for any -dimensional polytope, we can compute an -self-concordant barrier with in polynomial time (Lee and Sidford, 2014, 2019).
Given a self-concordant barrier , for any and , we assume that has full rank. Denote
[TABLE]
and define the Dikin’s ellipsoid of centered at of the radius as follows:
[TABLE]
The three lemmas below are used in the design and analysis of our proposed algorithm.
Lemma 2** (Theorem 2.1.2 by Nesterov and Nemirovskii (1994)).**
If is a self-concordant barrier for a closed convex set , every Dikin’s ellipsoid of of radius is contained in , i.e., holds for any .
Let denote the Minkowsky function of whose pole is at :
[TABLE]
We have an upper bound on expressed with this Minkowsky function, as follows:
Lemma 3** (Propositoin 2.3.2 by Nesterov and Nemirovskii (1994)).**
If is a -self-concordant barrier for , it holds for any and in that .
If we use a self-concordant barrier , we can use the following lemma to bound the stability term in Lemma 1.
Lemma 4**.**
Let be a self-concordant function on and . Let and . Suppose that . We then have
3 Algorithm
Let be the convex hull of and be a -self-concordant barrier for . In the proposed algorithm, we compute by solving the optimization problem (2) with , where is a learning rate parameter satisfying . The manner of computing , , , and will be presented below.
Action and unbiased estimator for loss vector
After computing , we choose the action so that . Let and be the set of eigenvectors and eigenvalues of . Define . Note that here holds since follows from the definition of and since follows from Lemma 2. In the algorithm by Abernethy et al. (2008b), the action is chosen from uniformly at random. Unlike this existing method, our proposed algorithm chooses an action from a set scaled up from with a reference point , or chooses with some probability. More precisely, after computing and choosing a point , we set by
[TABLE]
where is defined as the largest real number such that is included in . How to choose is discussed in the next pragraph. If we denote , we can express as follows:
[TABLE]
where is the Minkowsky function defined by (7). We choose so that the value of is as small as possible. Let denote the center of , i.e., define . We then set with probability and with probability . If , we choose . If , we choose from uniformly at random. In other words, we pick uniformly at random from and with probability , and set . We then output so that its expectation coincides with . After obtaining feedback of , we define by
[TABLE]
We can show that the conditional expectation of is equal to and that is an unbiased estimator of , i.e., we have and , proofs of which are given in Section D in the appendix. We note that, thanks to the scaled-up sampling, the mean square of is improved by a factor of , which plays a central role in our proof of BOTW regret bounds.
Reference point
We will see that the smaller value of is, the smaller variance of is, resulting in an improvement in regret. To take maximum advantage of this effect, we choose so that is as small as possible. More precisely, for a constant , we assume that satisfies
[TABLE]
for all . This assumption is used in our proof of -regret in stochastic environments.
Learning rate parameter
In the regret analysis in Section 4, we will show that the regret for the proposed algorithm is bounded as , where is defined as
[TABLE]
Intuitively, comes from the part of in (4), which is called stability terms, and comes from the part of , called penalty terms. To balance stability and penalty terms, we set by
[TABLE]
which leads to .
Optimistic prediction
To minimize the part of , we choose by using online projected gradient descent for . We set and update as follows:
[TABLE]
where is the learning rate parameter for updating .
The proposed algorithm can be summarized as Algorithm 1 in Section B in the appendix.
Computational complexity
The procedure in each round can be performed in polynomial time in , except for the computation of . Indeed, given a self-concordant barrier for , we can solve an arbitrary linear optimization problem over (and thus also over ), with the aid of, e.g., interior point methods (Nesterov and Nemirovskii, 1994). This implies that convex optimization problems (2) can be solved in polynomial time as well. Futher, for any , we can find an expression of convex combination of points in in polynomial time (Mirrokni et al., 2017; Schrijver, 1998, Corollary 11.4), which means that we can randomly choose so that . As for the calculation of satisfying (11), it is not clear if there is a computationally efficient way at this point. Because we can compute the value of for any in polynomial time in , we can find minimizing this value in time, which can be exponential in .
4 Analysis
4.1 Regret bounds for the proposed algorithm
Theorem 2** (Regret bounds in the adversarial regime).**
Let , and be parameters defined as in Table 1. The regret for Algorithm 1 is bounded as
[TABLE]
Further, if for any and , we have
[TABLE]
Note that the regret bounds in Theorem 2 are valid regardless of the choice of . In fact, we can demonstrate these regret bounds even if we sample from , as is similarly done with the algorithm by Abernethy et al. (2008b), which corresponds to . By way of contrast, to show -regret bounds for stochastic environments, we need the assumption of (11). Under this assumption, we have the following regret bounds:
Theorem 3** (Regret bounds in the corrupted stochastic regime).**
Let and denote . Define and . We have , where we define . Further, if exists uniquely, under the assumption of (11), we have
[TABLE]
where .
Remark 3**.**
In standard settings of the stochastic regime, it is assumed that follows an identical distribution for different rounds and for all . Such assumptions are not, however, needed in Theorem 3. In other words, even when is non-zero and changes depending on , we still have the -regret bounds given in Theorem 3.
4.2 Proof sketch
Regret bounds in Theorems 2 and 3 are derived from the following lemma:
Lemma 5**.**
The regret for Algorithm 1 is bounded as follows:
[TABLE]
In proving this lemma, we use Lemmas 1, 3 and 4. From Lemma 4, the stability term in Lemma 1 is bounded by . From Lemma 3, we can bound the penalty term in Lemma 1 as . Combining these bounds, we obtain . From this and the definition of given in (13), we have the regret bound in Lemma 5. A complete proof of this lemma is given in Section D in the appendix.
From the result of tracking linear experts (Herbster and Warmuth, 2001), we obtain the following upper bound on .
Lemma 6**.**
If is given by (14), it holds for any sequence that
[TABLE]
This lemma is a special case of Theorem 11.4 by Cesa-Bianchi and Lugosi (2006).
Proof sketch of Theorem 2
By substituting for all in (19), we obtain . Similarly, by substituting for all in (19), we obtain . Combining these with Lemma 5, we obtain (15) in Theorem 2. Further, if , by substituting , we obtain , which leads to a regret bound of . This implies that (16) in Theorem 2 holds.
Proof sketch of Theorem 3
By setting for all in (19), we obtain . As we have , from this bound and Lemma 5, we have . We also have the following regret bound:
[TABLE]
From the assumption of (11), is bounded as
[TABLE]
where the second inequality follows from . The following lemma provides an upper bound on the right-hand side of this:
Lemma 7**.**
Suppose uniquely exists. It holds for any that
[TABLE]
By combining this lemma with (20) and (21), we obtain a bound depending on as follows: On the other hand, regret is bounded from below as . By combining these two bounds on , we obtain
[TABLE]
As implies , we have
[TABLE]
which means that (17) holds. A complete proof is given in Section G of the appendix.
Appendix A Related Work
Best-of-Both-Worlds Bandit Algorithms
Best-of-both-worlds algorithms have been developed for various settings of multi-armed bandit (MAB) problems, including the standard MAB problem [Bubeck and Slivkins, 2012, Seldin and Slivkins, 2014, Zimmert and Seldin, 2021, Ito et al., 2022, Honda et al., 2023], combinatorial semi-bandits [Zimmert et al., 2019, Ito, 2021, Tsuchiya et al., 2023b], partial monitoring problems [Tsuchiya et al., 2023a], episodic Markov decision processes [Jin and Luo, 2020, Jin et al., 2021], and linear bandits [Lee et al., 2021]. While most of these studies focuses only on high-level adaptability, the algorithms by Ito et al. [2022], Tsuchiya et al. [2023a] for the MAB problem and combinatorial semi-bandit problems have low-level adaptability as well, similarly to our proposed algorithm. In fact, their algorithms are best-of-three-worlds algorithms with multiple data-dependent regret bounds as well as variance-adaptive regret bounds. Their algorithms are also similar to ours in that it is based on the optimistic follow-the-regularizer approach with an adaptive learning rate. As the class of linear bandits problem includes the multi-armed bandit problem, the results in this paper can be interpreted as an extension of their results. Regret bounds by Ito et al. [2022] are, however, better than ours in terms of the dependency on the dimensionality of the action set (or the number of arms) and in that they depend on arm-wise sub-optimality gaps.
Adversarial Corruption
There are several studies on the stochastic environment with adversarial corruption in the linear bandit problem [Li et al., 2019, Bogunovic et al., 2021, Lee et al., 2021] and the sibling problems such as the multi-armed bandits [Lykouris et al., 2018, Gupta et al., 2019, Zimmert and Seldin, 2021, Yang et al., 2020] and the linear Markov decision processes [Lykouris et al., 2021]. These studies and this paper have different assumptions and regret. This paper and Lee et al. [2021] assume that corruption depends only on information in the past rounds and is an affine function of the chosen action. On the other hand, Li et al. [2019], Bogunovic et al. [2021] allow corruption to be any (possibly non-linear) function. Furthermore, Bogunovic et al. [2021] consider the corruption that depends on the action chosen in that round. We also note that the definitions of the corruption level in these studies are slightly different. While this paper includes the corruption in regret, Li et al. [2019], Bogunovic et al. [2021], Lee et al. [2021] do not. It is known that we can convert one to the other by an additional -regret. Moreover, the regret bounds in these existing studies have linear terms with respect to . Thus, our regret bound for the corrupted stochastic regime have the same dependence of the corruption as in these studies, but not vice versa.
Misspecified Linear Contextual Bandits
The corrupted stochastic regime is a special case of the misspecified linear contextual bandits without knowledge of the misspecification [Lattimore et al., 2020, Foster et al., 2020, Pacchiano et al., 2020, Takemura et al., 2021, Krishnamurthy et al., 2021].222 Note that some studies assume oblivious adversary [Lattimore et al., 2020, Foster et al., 2020, Krishnamurthy et al., 2021], i.e., the approximation errors do not depend on the actions chosen in the past. This problem assumes that the expected loss functions can be approximated by a linear function. While the approximation error can be any function of the information in the past and the current rounds in general, the corrupted stochastic regime assumes that the approximation error is an affine function of the action chosen in the current round. It is an open question whether the proposed algorithm can obtain a regret upper bound similar to the known regret bounds for this problem when the approximation error can be non-linear.
Appendix B Pseudocode of the proposed algorithm
Appendix C Proof of Lemma 4
For the convex function and , denote the Newton decrement at point by , i.e., .
Lemma 8** (Theorem 2.2.1 by Nesterov and Nemirovskii [1994]).**
Let be an open non-empty convex subset of a finite-dimensional real vector space. Let be a self-concordant function on and . Then, for each such that , we have
[TABLE]
Lemma 9** ((2.21) by Nemirovski [2004]).**
Let be a self-concordant function on . If , we have
[TABLE]
where .
Lemma 10**.**
Let be a self-concordant function on and . Suppose that . Then, we have
[TABLE]
for all and .
Proof.
Using the Cauchy-Schwarz inequality and the AM-GM inequality, we have
[TABLE]
Thus, it is sufficient to show .
By Taylor’s theorem, we have for some where . It follows from Lemma 8 that
[TABLE]
∎
Proof of Lemma 4
Let . Since is self-concordant, there exists such that . If we have , by Lemma 9, we obtain
[TABLE]
Thus, we obtain
[TABLE]
where the first inequality holds due to and the second inequality is derived from Lemma 10. Hence, it suffices to show . By the definition of , we have . Thus, we obtain
[TABLE]
where the inequality is obtained by the assumption. ∎
Appendix D Proof of Lemma 5
We first show that . The expectation of is
[TABLE]
where the forth equality follows from .
Let us next show that defined by (10) is an unbiased estimator of . We have
[TABLE]
where we used , , and the fact that and are independent in the fifth equality.
Suppose that holds without loss of generality. Let . Given , define by
[TABLE]
From this, (23) and (24), we have
[TABLE]
Then, as we have , we have . Hence, from Lemma 3, we have
[TABLE]
From this, (25) and Lemma 1, we have
[TABLE]
The part of can bounded by using Lemma 4. From the definition (10), we have
[TABLE]
Hence, if , we have , and, consequently, we can apply Lemma 4 to bound the stability term as follows:
[TABLE]
where is defined in (12). Then, from this and (26), we have
[TABLE]
If is given by (13), we then have
[TABLE]
which yields
[TABLE]
We also have
[TABLE]
from the definition (13) of . Combining this with (28) and (29), we obtain
[TABLE]
which completes the proof.
Appendix E Proof of Theorem 2
Fix arbitrarily. By substituting for all in (19), we obtain
[TABLE]
Similarly, by substituting , we obtain
[TABLE]
By combining these with Lemma 5 and applying Jensen’s inequality, we obtain (15). Further, if , by substituting , we obtain
[TABLE]
By combining this with Lemma 5, we obtain
[TABLE]
which implies that (16) holds. ∎
Appendix F Proof of Lemma 7
As is the convex hull of , any point can be expressed as a convex combination of and a point in , which means that there exists and such that . For such , we have
[TABLE]
In fact, we have
[TABLE]
which means that (30) holds. We further have
[TABLE]
where the last inequality follows from the fact that and the definition of . Combining this with (30), we obtain
[TABLE]
We next show
[TABLE]
As is an ellipsoid centered at , it holds that
[TABLE]
We hence have
[TABLE]
where the inequality follows from the fact that . Combining (31) and (32), we obtain
[TABLE]
Appendix G Proof of Theorem 3
From Lemma 6 with , we have
[TABLE]
From this and Lemma 5, we have
[TABLE]
Under the assumption of (11), we have
[TABLE]
where second inequality follows from and the last inequality follows from Lemma 7. From this, (33) and , we have
[TABLE]
On the other hand, is bounded from below as follows:
[TABLE]
Combining this with (34), we obtain
[TABLE]
As implies , we have
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abernethy et al. [2008 a] J. Abernethy, E. E. Hazan, and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008 , pages 263–273, 2008 a.
- 2Abernethy et al. [2008 b] J. D. Abernethy, E. Hazan, and A. Rakhlin. An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory , 2008 b.
- 3Abernethy et al. [2012] J. D. Abernethy, E. Hazan, and A. Rakhlin. Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory , 58(7):4164–4175, 2012.
- 4Audibert et al. [2007] J.-Y. Audibert, R. Munos, and C. Szepesvári. Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory: 18th International Conference, ALT 2007, Sendai, Japan, October 1-4, 2007. Proceedings 18 , pages 150–165. Springer, 2007.
- 5Bogunovic et al. [2021] I. Bogunovic, A. Losalka, A. Krause, and J. Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics , pages 991–999. PMLR, 2021.
- 6Bubeck and Slivkins [2012] S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory , pages 42–1. JMLR Workshop and Conference Proceedings, 2012.
- 7Bubeck et al. [2012] S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory , volume 23, pages 41.1–41.14, 2012.
- 8Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games . Cambridge university press, 2006.
