No-Regret Linear Bandits beyond Realizability
Chong Liu, Ming Yin, Yu-Xiang Wang

TL;DR
This paper introduces a new model of misspecification in linear bandits that depends on the suboptimality gap, and shows that the classical LinUCB algorithm remains robust under this model, achieving near-optimal regret.
Contribution
It proposes a gap-dependent misspecification model for linear bandits and demonstrates that LinUCB is robust and effective under this new framework.
Findings
LinUCB achieves near-optimal $\
The new model captures realistic misspecification scenarios where errors are proportional to suboptimality gaps.
A novel self-bounding proof technique is developed to analyze regret under misspecification.
Abstract
We study linear bandits when the underlying reward function is not linear. Existing work relies on a uniform misspecification parameter that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever . We describe a more natural model of misspecification which only requires the approximation error at each input to be proportional to the suboptimality gap at . It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal regret for problems that the best-known regret is almost linear in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Quantum Computing Algorithms and Architecture · Stochastic Gradient Optimization Techniques
\zxrsetup
toltxlabel \zexternaldocument*camera_ready_supp
No-Regret Linear Bandits beyond Realizability
Chong Liu
Department of Computer Science
University of California
Santa Barbara, CA 93106, USA
Ming Yin
Department of Computer Science
University of California
Santa Barbara, CA 93106, USA
Yu-Xiang Wang
Department of Computer Science
University of California
Santa Barbara, CA 93106, USA
Abstract
We study linear bandits when the underlying reward function is not linear. Existing work relies on a uniform misspecification parameter that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever . We describe a more natural model of misspecification which only requires the approximation error at each input to be proportional to the suboptimality gap at . It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm — designed for the realizable case — is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal regret for problems that the best-known regret is almost linear in time horizon . Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself.
1 Introduction
Stochastic linear bandit is a classical problem of online learning and decision-making with many influential applications, e.g., A/B testing [Claeys et al., 2021], recommendation systems [Chu et al., 2011], advertisement placements [Wang et al., 2021], clinical trials [Moradipari et al., 2020], hyperparameter tuning [Alieva et al., 2021], and new material discovery [Katz-Samuels et al., 2020].
More formally, stochastic bandit is a sequential game between an agent who chooses a sequence of actions and nature who decides on a sequence of noisy observations (rewards) according to for some underlying function . The goal of the learner is to minimize the cumulative regret the agent experiences relative to an oracle who knows the best action to choose ahead of time, i.e.,
[TABLE]
where is called instantaneous regret.
Despite being highly successful in the wild, existing theory for stochastic linear bandits (or more generally learning-oracle based bandits problems [Foster et al., 2018, Foster and Rakhlin, 2020]) relies on a realizability assumption, i.e., the learner is given access to a function class such that the true expected reward satisfies that . Realizability is considered one of the strongest and most restrictive assumptions in the standard statistical learning setting, but in the linear bandits, all known attempts to deviate from the realizability assumption result in a regret that grows linearly with [Ghosh et al., 2017, Lattimore et al., 2020, Zanette et al., 2020, Neu and Olkhovskaya, 2020, Bogunovic and Krause, 2021, Krishnamurthy et al., 2021].
In practical applications, it is often observed that feature-based representation of the actions with function approximations in estimating the reward can result in very strong policies even if the estimated reward functions are far from being correct [Foster et al., 2018].
So what went wrong? The critical intuition we rely on is the following:
It should be sufficient for the estimated reward function to clearly differentiate good actions from bad ones, rather than requiring it to perfectly estimate the rewards numerically.
Contributions. In this paper, we formalize this intuition by defining a new family of misspecified bandit problems based on a condition that adjusts the need for an accurate approximation pointwise at every according to the suboptimality gap at . Unlike the existing misspecified linear bandits problems with a linear regret, our problem admits a nearly optimal regret despite being heavily misspecified. Specifically:
- •
We define -gap-adjusted misspecified (-GAM) function approximations and characterize how they preserve important properties of the true function that are relevant for optimization.
- •
We show that the classical LinUCB algorithm [Abbasi-yadkori et al., 2011] can be used as is (up to some mild hyperparameters) to achieve an regret under a moderate level of gap-adjusted misspecification (). In comparison, the regret bound one can obtain under the corresponding uniform misspecification setting is only . This represents an exponential improvement in the average regret metric .
To the best of our knowledge, the suboptimality gap-adjusted misspecification problem was not studied before and we are the first to obtain -style regrets without a realizability assumption.
Technical novelty. Due to misspecification, we have technical challenges that appear in bounding the instantaneous regret and parameter uncertainty region. We tackle the challenges by a self-bounding trick, i.e., bounding the instantaneous regret by the instantaneous regret itself, which can be of independent interest in more settings, e.g., Gaussian process bandit optimization and reinforcement learning.
2 Related Work
The problem of linear bandits was first introduced in Abe and Long [1999]. Then Auer et al. [2002] proposed the upper confidence bound to study linear bandits where the number of actions is finite. Based on it, Dani et al. [2008] proposed an algorithm based on confidence ellipsoids and then Abbasi-yadkori et al. [2011] simplified the proof with a novel self-normalized martingale bound. Later Chu et al. [2011] proposed a simpler and more robust linear bandit algorithm and showed regret cannot be improved beyond a polylog factor. Li et al. [2019] further improved the regret upper and lower bound, which characterized the minimax regret up to an iterated logarithmic factor. See Lattimore and Szepesvári [2020] for a detailed survey of linear bandits.
In terms of misspecification, Ghosh et al. [2017] first studied the misspecified linear bandit with a fixed action set. They found that LinUCB [Abbasi-yadkori et al., 2011] is not robust when misspecification is large. They showed that in a favourable case when one can test the linearity of the reward function, their RLB algorithm is able to switch between the linear bandit algorithm and finite-armed bandit algorithm to address misspecification issue and achieve the regret where is number of arms.
The most studied setting of model misspecification is uniform misspecification where the distance between the best-in-class function and the true function is always upper bounded by some parameter , i.e.,
Definition 1** (-uniform misspecification).**
We say function class is an -uniform misspecified approximation of if there exists such that .
Under this definition, Lattimore et al. [2020] proposed the optimal design-based phased elimination algorithm for misspecified linear bandits and achieved regret when number of actions is infinite. They also found that with modified confidence band in LinUCB, LinUCB is able to achieve the same regret. With the same misspecification model, Foster and Rakhlin [2020] studied contextual bandit with regression oracle, Neu and Olkhovskaya [2020] studied multi-armed linear contextual bandit, and Zanette et al. [2020] studied misspecified contextual linear bandits after reduction of the algorithm. All of their results suffer from linear regrets. Later Bogunovic and Krause [2021] studied misspecified Gaussian process bandit optimization problem and achieved regret when linear kernel is used in Gaussian process. Moreover, their lower bound shows that term is unavoidable in this setting.
Besides uniform misspecification, there are some work considering different definitions of misspecification. Krishnamurthy et al. [2021] defines misspecification error as an expected squared error between true function and best-in-class function where expectation is taken over distribution of context space and action space. Foster et al. [2020] considered average misspecification, which is weaker than uniform misspecification and allows tighter regret bound. However, they also have linear regrets. Our work is different from all related work mentioned above because we are working under a newly defined misspecifiation condition and show that LinUCB is a no-regret algorithm in this case.
Model misspecification is naturally addressed in the related agnostic contextual bandits setting [Agarwal et al., 2014], but these approaches typically require the action space to be finite, thus not directly applicable to our problem. In addition, empirical evidence [Foster et al., 2018] suggests that the regression oracle approach works better in practice than the agnostic approach even if realizability cannot be verified.
3 Preliminaries
3.1 Notations
Let denote the integer set . The algorithm runs in rounds in total. Let denote the true function, so the maximum function value is defined as and the maximum point is defined as . Let and denote the domain and range of . We use to denote the parameter class of a family of linear functions where . Define as the parameter of best linear approximation function. and . For a vector , its norm is denoted by and for a matrix its operator norm is denoted by . For a vector and a square matrix , define .
3.2 Problem Setup
We consider the following optimization problem:
[TABLE]
where is the true function which might not be linear in . We want to use a linear function to approximate and maximize . At time , after querying a data point , we will receive a noisy feedback:
[TABLE]
where is independent, zero-mean, and -sub-Gaussian noise.
The major highlight of our study is that we do not rely on the popular realizability assumption (i.e. ) that is frequently assumed in the existing function approximation literature. Alternatively, we propose the following gap-adjusted misspecification condition.
Definition 2** (-gap-adjusted misspecification).**
We say a function is a -gap-adjusted misspecified (or -GAM in short) approximation of if for parameter ,
[TABLE]
We say function class satisfies -GAM for , if there exists such that is a -GAM approximation of .
Observe that when , this recovers the standard realizability assumption, but when it could cover many misspecified function classes.
Figure 1(a) shows a 1-dimensional example with and piece-wise linear function that satisfies local misspecification. With Definition 2, we have the following proposition.
Proposition 3**.**
Let be a -GAM approximation of (Definition 2). Then it holds:
- •
(Preservation of maximizers)
[TABLE]
- •
(Preservation of max value)
[TABLE]
- •
(Self-bounding property)
[TABLE]
This tells and coincide on the same global maximum points and the same global maxima if Definition 2 is satisfied, while allowing and to be different (potentially large) at other locations. Therefore, Definition 2 is a “local” assumption that does not require to be uniformly close to (e.g. the “uniform” misspecification assumes ). Proof of Proposition 3 is shown in Appendix A.
In addition, we can modify Definition 2 with a slightly weaker condition that only requires but not necessarily .
Definition 4** (Weak -gap-adjusted misspecification).**
Denote . Then we say is (weak) -gap-adjusted misspecification approximation of for a parameter if:
[TABLE]
See Figure 1(b) for an example satisfying Definition 4, in which there is a constant gap between and . The idea of this weaker assumption is that we can always extend the function class by adding a single offset parameter w.l.o.g. to learn the constant gap . In the linear case, this amounts to homogenizing the feature vector by appending . For this reason, we stick to Definition 2 and linear function approximation for conciseness and clarity in the main paper. See Appendix B for formal statements and Appendix C for proofs of regret bound of linear bandits under Definition 4.
Note that both Definition 2 and Definition 4 are defined generically which do not require any assumptions on the parametric form of . While we focus on the linear bandit setting in this paper, this notion can be considered for arbitrary function approximation learning problems.
3.3 Assumptions
Assumption 5** (Boundedness).**
For any , . For any , . Moreover, for any , the true expected reward function .
These are mild assumptions that we assume for convenience. Relaxations of these are possible but not the focus of this paper. Note that the additional assumption is not required when is realizable.
Assumption 6**.**
Suppose is a compact set, and all the global maximizers of live on the dimensional hyperplane. i.e., , s.t.
[TABLE]
For instance, when , the above reduces to that has a unique maximizer. This is a compatibility assumption for Definition 2, since any linear function that violates Assumption 6 will not satisfy Definition 2.
In addition, to obtain an regret, for any finite sample , we require the following condition.
Assumption 7** (Low misspecification).**
The linear function class is a -GAM approximation of with
[TABLE]
The condition is required for technical reasons. Relaxing this condition for LinUCB may require fundamental breakthroughs that knock out logarithmic factors from its regret analysis. This will be further clarified in the proof. In general, however, we conjecture that this condition is not needed and there are algorithms that can achieve regret for any , but a new algorithm needs to be designed.
While this assumption may suggest that we still require realizability in a truly asymptotic world, handling a level of misspecification is highly non-trivial in finite sample setting. For instance, if is a trillion, . This means that for most practical cases, LinUCB is able to tolerate a constant level of misspecification under the GAM model.
3.4 LinUCB Algorithm
We will focus on analyzing the classical Linear Upper Confidence Bound (LinUCB) algorithm due to [Dani et al., 2008, Abbasi-yadkori et al., 2011], shown below.
4 Main Results
In this section, we show that the classical LinUCB algorithm [Abbasi-yadkori et al., 2011] works in -gap-adjusted misspecified linear bandits and achieves cumulative regret at the order of . The following theorem shows the cumulative regret bound.
Theorem 8**.**
Suppose Assumptions 5, 6, and 7 hold. Set
[TABLE]
Then Algorithm 1 guarantees w.p. simultaneously for all
[TABLE]
Remark 9**.**
The result shows that LinUCB achieves cumulative regret bound and thus it is a no-regret algorithm in -gap-adjusted misspecified linear bandits. In contrast, LinUCB can only achieve regret in uniformly misspecified linear bandits. Even if , the resulting regret is still exponentially worse than ours.
Proof.
By definition of cumulative regret, function range absolute bound , and Cauchy-Schwarz inequality,
[TABLE]
Observe that the choice of is monotonically increasing in . Also by Lemma 14, we get that with probability , , which verifies the condition to apply Lemma 12 simultaneously for all , thereby completing the proof. ∎
4.1 Regret Analysis
The proof follows the LinUCB analysis closely. The main innovation is a self-bounding argument that controls the regret due to misspecification by the regret itself. This appears in Lemma 11 and then again in the proof of Lemma 14.
Before we proceed, let denote the deviation term of our linear function from the true function at , formally,
[TABLE]
And our observation model (eq. (1)) becomes
[TABLE]
Moreover, we have the following lemma showing the property of deviation term .
Lemma 10** (Bound of deviation term).**
,
[TABLE]
Proof.
Recall the definition of deviation term in eq. (6):
[TABLE]
By Definition 2, ,
[TABLE]
where the third line is by Proposition 3 and the proof completes by taking the absolute value of the lower and upper bounds. ∎
Next, we prove instantaneous regret bound and its sum of squared regret version in the following two lemmas:
Lemma 11** (Instantaneous regret bound).**
Define , assume then for each
[TABLE]
Proof.
By definition of instantaneous regret,
[TABLE]
where the inequality is by Definition 2. Therefore, by rearranging the inequality we have
[TABLE]
where the last inequality is by Lemma 13. ∎
Lemma 12**.**
Assume is monotonically nondecreasing and for all , then
[TABLE]
Proof.
By definition and Lemma 11,
[TABLE]
where the second inequality is by the monotonic increasing property of and the last inequality uses the elliptical potential lemma (Lemma 16). ∎
Previous two lemmas hold on the following lemma, bounding the gap between and the linear function value at , shown below.
Lemma 13**.**
Define and assume is chosen such that . Then
[TABLE]
Proof.
Let denote the parameter that achieves , by the optimality of ,
[TABLE]
where the second inequality applies Holder’s inequality; the last line uses the definition of (note that both ∎
4.2 Confidence Analysis
All analysis in the previous section requires . In this section, we show that our choice of in (5) is valid and is trapped in the uncertainty set with high probability.
Lemma 14** (Feasibility of ).**
Suppose Assumptions 5, 6, and 7 hold. Set as in eq. (5). Then, w.p. ,
[TABLE]
Proof.
By setting the gradient of objective function in eq. (4) to be [math], we obtain the closed form solution of eq. (4):
[TABLE]
Therefore,
[TABLE]
where the second equation is by eq. (7) and the first two terms of eq. (8) can be further simplified as
[TABLE]
where the second equation is by definition of (eq. (3)). Therefore, eq. (8) can be rewritten as
[TABLE]
Multiply both sides by and we have
[TABLE]
Take a square of both sides and apply generalized triangle inequality, we have
[TABLE]
The remaining task is to bound these three terms separately. The first term of eq. (9) is bounded as
[TABLE]
where the first inequality is by definition of and and the second inequality is by choice of .
The second term of eq. (9) can be bounded by Lemma 17 and Lemma 20:
[TABLE]
where is chosen as so that the total failure probabilities over rounds can always be bounded by :
[TABLE]
And the third term of eq. (9) can be bounded as
[TABLE]
where the last line is by taking the absolute value and Cauchy-Schwarz inequality. Continue the proof and we have
[TABLE]
where the first inequality is due to Cauchy-Schwarz inequality and the second uses the self-bounding properties from Proposition 3 and Lemma 15.
To put things together, we have shown that w.p. , for any ,
[TABLE]
where we condition on (10) for the rest of the proof.
Observe that this implies that the feasibility of in can be enforced if we choose to be larger than (10). The feasiblity of in turn allows us to apply Lemma 11 to bound the RHS with . We will use induction to prove that our choice
[TABLE]
is valid, where short hand
[TABLE]
For the base case , by eq. (10) and the definition of we directly have . Assume our choice of is feasible for , then we can write
[TABLE]
where the second line is due to non-decreasing property of . Then by Lemma 16 and Assumption 7, we have
[TABLE]
The critical difference from the standard LinUCB analysis here is that if appears on the LHS of the bound and if its coefficient is larger, any valid bound for will have to grow exponentially in . This is where Assumption 7 helps us. Assumption 7 ensures that the coefficient of is smaller than , so we can take and move to the right-hand side. ∎
Proof of previous lemma needs the following two lemmas.
Lemma 15** (Upper bound of ).**
[TABLE]
Proof.
Recall that .
[TABLE]
The last line follows from the fact that is positive semidefinite. ∎
Lemma 16** (Upper bound of (adapted from Abbasi-yadkori et al. [2011])).**
[TABLE]
Proof.
First we prove that . Recall the definition of and we know is a positive semidefinite matrix and thus . To prove , we need to decompose and write
[TABLE]
Let and it becomes
[TABLE]
By Sherman-Morrison lemma (Lemma 18), we have
[TABLE]
Next we use the fact that and we have
[TABLE]
where the last two lines are by Lemma 19 and Lemma 20. ∎
5 Technical Lemmas
Lemma 17** (Self-normalized bound for vector-valued martingales (Lemma A.9 of Agarwal et al. [2021])).**
Let be a real-valued stochastic process with corresponding filtration such that is measurable, , and is conditionally -sub-Gaussian with . Let be a stochastic process with (some Hilbert space) and being measurable. Assume that a linear operator is positive definite, i.e., for any . For any , define the linear operator (here denotes outer-product in ). With probability at least , we have for all :
[TABLE]
Lemma 18** (Sherman-Morrison lemma [Sherman and Morrison, 1950]).**
Let denote a matrix and denote two vectors. Then
[TABLE]
Lemma 19** (Lemma 6.10 of Agarwal et al. [2021]).**
Define and we have
[TABLE]
Lemma 20** (Potential function bound (Lemma 6.11 of Agarwal et al. [2021])).**
For any sequence such that for , we have
[TABLE]
6 Conclusion
We study linear bandits with the underlying reward function being non-linear, which falls into the misspecified bandit framework. Existing work on misspecified bandit usually assumes uniform misspecification where the distance between the best-in-class function and the true function is upper bounded by the misspecification parameter . Existing lower bound shows that the term is unavoidable where is the time horizon, thus the regret bound is always linear. However, in solving optimization problems, one only cares about the approximation error near the global optimal point and approximation error is allowed to be large in highly suboptimal regions. In this paper, we capture this intuition and define a natural model of misspecification, called -gap-adjusted misspecificaiton, which only requires the approximation error at each input to be proportional to the suboptimality gap at with being the proportion parameter.
Previous work found that classical LinUCB algorithm is not robust in -uniform misspecified linear bandit when is large. However, we show that LinUCB is automatically robust against such gap-adjusted misspecification. Under mild conditions, e.g., , we prove that it achieves the near-optimal regret for problems that the best-known regret is almost linear. Also, LinUCB doesn’t need the knowledge of to run. However, if the upper bound of is revealed to LinUCB, the term can be carefully chosen according to eq. (11). Our technical novelty lies in a new self-bounding argument that bounds part of the regret due to misspecification by the regret itself, which can be of independent interest in more settings.
We believe our analysis for LinUCB is tight and the requirement that is essential, but we conjecture that there is a different algorithm that could handle constant or even when approaches at a rate of . We leave the resolution to this conjecture as future work. For completeness, we include a simulation section in Appendix D.
More broadly, our paper opens a brand new door for research in model misspecification, including misspecified linear bandits, misspecified kernelized bandits, and even reinforcement learning with misspecified function approximation. Moreover, we hope our paper make people rethink about the relationship between function optimization and function approximation. In the future, much more can be done. For example, we can design a new no-regret algorithm that works under gap-adjusted misspecification framework where is a constant, and study -gap-adjusted misspecified Gaussian process bandit optimization.
Acknowledgments
The work was partially supported by NSF Awards #2007117 and #2003257. We thank Ilija Bogunovic for the discussion at the early stage of this paper. Finally, we thank UAI reviewers and the area chair for their valuable input that led to improvements to the paper.
Appendix A Proof of Proposition 3
Equivalently, -gap-adjusted misspecification (Definition 2) satisfies
[TABLE]
Proof of preservation of max value: .
Let . We first prove by contradiction. Suppose , since is compact, there exists such that . Then by eq. (12) this implies
[TABLE]
Contraction! Therefore, . On the other hand, choose , then by (12) . This implies . Combing both results to obtain . ∎
Proof of preservation of maximizers: .
Using that and , it is easy to verify . On the other hand, if , then by eq. (12) and this means . ∎
Proof of self-bounding property.
This directly comes from the definition. ∎
Appendix B Property of Weak -Gap-Adjusted Misspecification
First we recall Definition 4.
Definition 21** (Restatement of Weak -gap-adjusted misspecification).**
Denote . Then we say is (weak) -gap-adjusted misspecification approximation of for a parameter if:
[TABLE]
Under the weak -gap-adjusted misspecification condition, it no longer holds . However, it still preserves the maximizers.
Proposition 22**.**
Under the weak -gap-adjusted misspecification condition, it holds
[TABLE]
Proof.
Suppose , then by definition
[TABLE]
On the other hand, if , then
[TABLE]
∎
The next proposition shows the weak -adjusted misspecification condition characterizes the suboptimality gap between and .
Proposition 23**.**
Denote , , then the weak -gap-adjusted misspecification condition implies:
[TABLE]
This can be proved directly by the triangular inequality. This reveals the weak -gap-adjusted misspecification condition requires to live in the band , and the concrete maximum values and can be arbitrarily different.
Appendix C Linear Bandits under the Weak -Gap-Adjusted Misspecification
We need to slightly modify LinUCB [Abbasi-yadkori et al., 2011] and work with the following LinUCBw algorithm.
Theorem 24**.**
Suppose Assumptions 5, 6, and 7 hold. W.l.o.g., assuming . Set
[TABLE]
Then Algorithm 2 guarantees w.p. simultaneously for all
[TABLE]
Remark 25**.**
The result again shows that LinUCBw algorithm achieves cumulative regret and thus it is also a no-regret algorithm under the weaker condition (Definition 4). Note Definition 4 is quite weak which even doesn’t require the true function sits within the approximation function class.
Proof.
The analysis is similar to the -gap-adjusted case but includes . For instance, let denote the deviation term of our linear function from the true function at , then
[TABLE]
And our observation model (eq. (1)) becomes
[TABLE]
Then similar to Lemma 10, we have the following lemma, whose proof is nearly identical to Lemma 10.
Lemma 26** (Bound of deviation term).**
,
[TABLE]
We also provide the following lemma, which is the counterpart of Lemma 13.
Lemma 27**.**
Define and assume is chosen such that . Then
[TABLE]
Proof.
Let denote the parameter that achieves , by the optimality of ,
[TABLE]
where the second inequality applies Holder’s inequality; the last line uses the definition of (note that both ∎
The rest of the analysis follows the analysis of Theorem 8. ∎
Appendix D Simulation
In this section, we run a simulation on a -dimensional test function shown in Figure 2(a). Here we run the first iterations with uniform sampling and the remaining iterations are using LinUCB algorithm. In Figure 2(b) we can see that cumulative regret is increasing with uniform sampling but it doesn’t increase when running LinUCB. The reason behind it is that under the gap-adjusted misspecification, LinUCB is able to quickly find the optimal point .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbasi-yadkori et al. [2011] Yasin Abbasi-yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems , 24, 2011.
- 2Abe and Long [1999] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic concepts. In International Conference on Machine Learning , 1999.
- 3Agarwal et al. [2014] Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning , 2014.
- 4Agarwal et al. [2021] Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun. Reinforcement learning: Theory and algorithms, 2021.
- 5Alieva et al. [2021] Ayya Alieva, Ashok Cutkosky, and Abhimanyu Das. Robust pure exploration in linear bandits with limited budget. In International Conference on Machine Learning , 2021.
- 6Auer et al. [2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47:235–256, 2002.
- 7Bogunovic and Krause [2021] Ilija Bogunovic and Andreas Krause. Misspecified gaussian process bandit optimization. Advances in Neural Information Processing Systems , 34, 2021.
- 8Chu et al. [2011] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics , 2011.
