Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive   Regret Bounds

Shinji Ito; Kei Takemura

arXiv:2302.12370·cs.LG·February 27, 2023

Best-of-Three-Worlds Linear Bandit Algorithm with Variance-Adaptive Regret Bounds

Shinji Ito, Kei Takemura

PDF

Open Access

TL;DR

This paper introduces a hierarchical adaptive linear bandit algorithm that achieves optimal regret bounds across adversarial and stochastic environments, with variance-adaptive performance and data-dependent regret guarantees.

Contribution

It presents a novel hierarchical algorithm that attains best-of-three-worlds regret bounds and incorporates new techniques for high-level and low-level adaptability in linear bandits.

Findings

01

Achieves ${O}(\sqrt{T \log T})$ regret in adversarial settings.

02

Attains $O(rac{\log T}{\Delta_{ ext{min}}} + \sqrt{rac{C \log T}{\Delta_{ ext{min}}})$ regret in stochastic environments with corruption.

03

Provides variance-adaptive regret bounds of $O(rac{\sigma^2 \log T}{\Delta_{ ext{min}}})$.

Abstract

This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of $O (T lo g T)$ for adversarial environments and of $O (\frac{l o g T}{Δ _{m i n}} + \frac{C l o g T}{Δ _{m i n}})$ for stochastic environments with adversarial corruptions, where $T$ , $Δ_{m i n}$ , and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent…

Tables2

Table 1. Table 1: List of parameters in regret bounds.

Parameter	Description
$T \in ℕ$	time horizon
$d \in ℕ$	dimensionality of action set
$𝒜 \subseteq ℝ^{d}$	action set (One may assume $\log (\| 𝒜 \|) = O (d \log T)$ w.l.o.g.)
$ϑ \geq 1$	parameter of a self-concordant barrier $ψ$ used in the algorithm
	(One can choose $ψ$ so that $ϑ = O (d)$ )
$Δ_{\min} > 0$	minimum suboptimality gap: $Δ_{\min} = \min_{a \in 𝒜 ∖ {a^{}}} ⟨ ℓ^{}, a - a^{*} ⟩$
$c^{*} > 0$	asymptotic lower bound parameter: $c^{} = c (𝒜, ℓ^{}) = O (d / Δ_{\min})$
$σ^{2} \geq 0$	maximum variance of loss: $σ^{2} = \max_{a \in 𝒜, t} 𝐄 [{(f_{t} (a) - ⟨ ℓ^{*}, a ⟩)}^{2}]$
$L^{*} \geq 0$	minimum cumulative loss: $L^{} = \min_{a^{} \in 𝒜} 𝐄 [\sum_{t = 1}^{T} f_{t} (a^{*})]$
$Q \geq 0$	total quadratic variation in loss sequence: $Q = \min_{\bar{ℓ} \in ℝ^{d}} 𝐄 [\sum_{t = 1}^{T} {‖ ℓ_{t} - \bar{ℓ} ‖}_{2}^{2}]$
$P \geq 0$	path-length of loss sequence: $P = 𝐄 [\sum_{t = 1}^{T - 1} {‖ ℓ_{t} - ℓ_{t + 1} ‖}_{2}]$

Table 2. Table 2: Regret bounds for stochastic and adversarial linear bandits. Bounds depending on L ∗ superscript 𝐿 L^{*} are applicable when f t ( a ) ≥ 0 subscript 𝑓 𝑡 𝑎 0 f_{t}(a)\geq 0 . Bounds depending on Q 𝑄 Q or P 𝑃 P are applicable when ε t ( a ) = 0 subscript 𝜀 𝑡 𝑎 0 \varepsilon_{t}(a)=0 .

Algorithm	Stochastic	Adversarial
Bubeck et al. (2012)		$O (\sqrt{d T \log (\| 𝒜 \|)})$
Abernethy et al. (2008a)		$O (d \sqrt{ϑ T \log T})$
Ito (2021)		$O (d \sqrt{\min {T, L^{*}, Q, P}} {(\log T)}^{2})$
Lattimore and Szepesvari (2017)	$c^{*} \log T + o (\log T)$
Lee et al. (2021)	$O (c^{*} \log (T \| 𝒜 \|) \log T)$	$O (\sqrt{d T} \log (T \| 𝒜 \| \log T))$
[This work]	$O ((\frac{d σ^{2}}{Δ_{\min}} + 1) d ϑ \log T)$	$O (d \sqrt{ϑ \min {T, L^{*}, Q, P} \log T})$

Equations150

f_{t} (a) = ⟨ ℓ_{t}, a ⟩ + ε_{t} (a), \mbox w h er e E [ε_{t} (a) ∣ (a_{s})_{s = 1}^{t - 1}] = ξ_{t} \mbox f or a l l a \in A .

f_{t} (a) = ⟨ ℓ_{t}, a ⟩ + ε_{t} (a), \mbox w h er e E [ε_{t} (a) ∣ (a_{s})_{s = 1}^{t - 1}] = ξ_{t} \mbox f or a l l a \in A .

x_{t} \in x \in X arg min {⟨ m_{t} + s = 1 \sum t - 1 \hat{ℓ}_{s}, x ⟩ + ψ_{t} (x)},

x_{t} \in x \in X arg min {⟨ m_{t} + s = 1 \sum t - 1 \hat{ℓ}_{s}, x ⟩ + ψ_{t} (x)},

D_{ψ} (x, y) = ψ (x) - ψ (y) - ⟨ \nabla ψ (y), x - y ⟩,

D_{ψ} (x, y) = ψ (x) - ψ (y) - ⟨ \nabla ψ (y), x - y ⟩,

t = 1 \sum T ⟨ \hat{ℓ}_{t}, x_{t} - x^{*} ⟩ \leq t = 1 \sum T (⟨ \hat{ℓ}_{t} - m_{t}, x_{t} - \tilde{x}_{t + 1} ⟩ - D_{ψ_{t}} (\tilde{x}_{t + 1}, x_{t})) + ψ_{T + 1} (x^{*}),

t = 1 \sum T ⟨ \hat{ℓ}_{t}, x_{t} - x^{*} ⟩ \leq t = 1 \sum T (⟨ \hat{ℓ}_{t} - m_{t}, x_{t} - \tilde{x}_{t + 1} ⟩ - D_{ψ_{t}} (\tilde{x}_{t + 1}, x_{t})) + ψ_{T + 1} (x^{*}),

∥ h ∥_{x, ψ} = h^{⊤} \nabla^{2} ψ (x) h, ∥ h ∥_{x, ψ}^{*} = h^{⊤} (\nabla^{2} ψ (x))^{- 1} h

∥ h ∥_{x, ψ} = h^{⊤} \nabla^{2} ψ (x) h, ∥ h ∥_{x, ψ}^{*} = h^{⊤} (\nabla^{2} ψ (x))^{- 1} h

W_{r} (x) = {y \in R^{d} ∣ ∥ y - x ∥_{x, ψ} \leq r} .

W_{r} (x) = {y \in R^{d} ∣ ∥ y - x ∥_{x, ψ} \leq r} .

π_{z, X} (x) = in f {r > 0 ∣ z + r^{- 1} (x - z) \in X} .

π_{z, X} (x) = in f {r > 0 ∣ z + r^{- 1} (x - z) \in X} .

E_{t}^{'} = {z_{t} + α_{t} (x - z_{t}) ∣ x \in E_{t}},

E_{t}^{'} = {z_{t} + α_{t} (x - z_{t}) ∣ x \in E_{t}},

r_{t} = in f {r > 0 ∣ z_{t} + r_{t}^{- 1} (x - z_{t}) \in X (x \in E_{t})} = x \in E_{t} max π_{z_{t}, X} (x),

r_{t} = in f {r > 0 ∣ z_{t} + r_{t}^{- 1} (x - z_{t}) \in X (x \in E_{t})} = x \in E_{t} max π_{z_{t}, X} (x),

\hat{ℓ}_{t} = m_{t} + d b_{t} v_{t} λ_{i_{t}}^{1/2} (f_{t} (a_{t}) - ⟨ m_{t}, a_{t} ⟩) e_{i_{t}} .

\hat{ℓ}_{t} = m_{t} + d b_{t} v_{t} λ_{i_{t}}^{1/2} (f_{t} (a_{t}) - ⟨ m_{t}, a_{t} ⟩) e_{i_{t}} .

r_{t} = x \in E_{t} max π_{z_{t}, X} (x) \leq κ \cdot z \in A min x \in E_{t} max π_{z, X} (x)

r_{t} = x \in E_{t} max π_{z_{t}, X} (x) \leq κ \cdot z \in A min x \in E_{t} max π_{z, X} (x)

g_{t} (m) = b_{t} \cdot (⟨ a_{t}, m ⟩ - f_{t} (a_{t}))^{2} .

g_{t} (m) = b_{t} \cdot (⟨ a_{t}, m ⟩ - f_{t} (a_{t}))^{2} .

β_{t} = 6 d + 2 d \frac{\sum _{s = 1}^{t - 1} g _{s} ( m _{s} )}{ϑ lo g T},

β_{t} = 6 d + 2 d \frac{\sum _{s = 1}^{t - 1} g _{s} ( m _{s} )}{ϑ lo g T},

m_{t + 1}^{'} = m_{t} - η b_{t} \cdot (⟨ a_{t}, m_{t} ⟩ - f_{t} (a_{t})) a_{t}, m_{t + 1} = min {1, \frac{1}{∥ m _{t + 1}^{'} ∥ _{2}}} m_{t + 1}^{'},

m_{t + 1}^{'} = m_{t} - η b_{t} \cdot (⟨ a_{t}, m_{t} ⟩ - f_{t} (a_{t})) a_{t}, m_{t + 1} = min {1, \frac{1}{∥ m _{t + 1}^{'} ∥ _{2}}} m_{t + 1}^{'},

R_{T} = O d ϑ lo g T (min {Q, P} + E [t = 1 \sum T (ε_{t} (a_{t}))^{2}]) + d ϑ lo g T .

R_{T} = O d ϑ lo g T (min {Q, P} + E [t = 1 \sum T (ε_{t} (a_{t}))^{2}]) + d ϑ lo g T .

R_{T} = O (d ϑ L^{*} lo g T + d ϑ lo g T) .

R_{T} = O (d ϑ L^{*} lo g T + d ϑ lo g T) .

R_{T} (a^{*}) = O ((\frac{κ d σ ^{2}}{Δ _{m i n}} + 1) d ϑ lo g T + (\frac{κ σ ^{2}}{Δ _{m i n}} + 1) C d^{2} ϑ lo g T),

R_{T} (a^{*}) = O ((\frac{κ d σ ^{2}}{Δ _{m i n}} + 1) d ϑ lo g T + (\frac{κ σ ^{2}}{Δ _{m i n}} + 1) C d^{2} ϑ lo g T),

R_{T} = O d \cdot E ϑ lo g T t = 1 \sum T g_{t} (m_{t}) + d ϑ lo g T .

R_{T} = O d \cdot E ϑ lo g T t = 1 \sum T g_{t} (m_{t}) + d ϑ lo g T .

t = 1 \sum T g_{t} (m_{t}) \leq \frac{1}{1 - 2 η} t = 1 \sum T g_{t} (u_{t}) + \frac{1}{η ( 1 - 2 η )} (2 t = 1 \sum T ∥ u_{t + 1} - u_{t} ∥_{2} + \frac{1}{2} ∥ u_{T + 1} ∥_{2}^{2}) .

t = 1 \sum T g_{t} (m_{t}) \leq \frac{1}{1 - 2 η} t = 1 \sum T g_{t} (u_{t}) + \frac{1}{η ( 1 - 2 η )} (2 t = 1 \sum T ∥ u_{t + 1} - u_{t} ∥_{2} + \frac{1}{2} ∥ u_{T + 1} ∥_{2}^{2}) .

R_{T} = O d (C + σ^{2} E [t = 1 \sum T r_{t}]) ϑ lo g T + d ϑ lo g T .

R_{T} = O d (C + σ^{2} E [t = 1 \sum T r_{t}]) ϑ lo g T + d ϑ lo g T .

r_{t} \leq κ \cdot z \in A min {x \in E_{t} max π_{z, X} (x)} \leq κ \cdot z \in A min {x \in W_{1} (x_{t}) max π_{z, X} (x)},

r_{t} \leq κ \cdot z \in A min {x \in E_{t} max π_{z, X} (x)} \leq κ \cdot z \in A min {x \in W_{1} (x_{t}) max π_{z, X} (x)},

x \in W_{1} (y) max π_{a^{*}, X} (x) \leq 2 \frac{Δ ( y )}{Δ _{m i n}}, \mbox w h er e Δ (y) = ⟨ ℓ^{*}, y - a^{*} ⟩, Δ_{m i n} = a \in A ∖ {a^{*}} min Δ (a) .

x \in W_{1} (y) max π_{a^{*}, X} (x) \leq 2 \frac{Δ ( y )}{Δ _{m i n}}, \mbox w h er e Δ (y) = ⟨ ℓ^{*}, y - a^{*} ⟩, Δ_{m i n} = a \in A ∖ {a^{*}} min Δ (a) .

R_{T} (a^{*})

R_{T} (a^{*})

= O d \frac{ϑ κ σ ^{2} lo g T}{Δ _{m i n}} R_{T} (a^{*}) + d (\frac{κ σ ^{2}}{Δ _{m i n}} + 1) C ϑ lo g T + d ϑ lo g T .

R_{T} (a^{*})

R_{T} (a^{*})

= O ((\frac{d κ σ ^{2}}{Δ _{m i n}} + 1) d ϑ lo g T + d (\frac{κ σ ^{2}}{Δ _{m i n}} + 1) C ϑ lo g T),

(1 - ∥ x - y ∥_{x, ψ})^{2} \nabla^{2} ψ (y) ⪯ \nabla^{2} ψ (x) ⪯ (1 - ∥ x - y ∥_{x, ψ})^{- 2} \nabla^{2} ψ (y)

(1 - ∥ x - y ∥_{x, ψ})^{2} \nabla^{2} ψ (y) ⪯ \nabla^{2} ψ (x) ⪯ (1 - ∥ x - y ∥_{x, ψ})^{- 2} \nabla^{2} ψ (y)

∥ x - x^{*} ∥_{x, ψ} \leq \frac{λ ( x , ψ )}{1 - λ ( x , ψ )},

∥ x - x^{*} ∥_{x, ψ} \leq \frac{λ ( x , ψ )}{1 - λ ( x , ψ )},

⟨ ℓ, x - y ⟩ - β D_{ψ} (y, x) \leq \frac{2}{β} ∥ ℓ ∥_{x, ψ}^{* 2}

⟨ ℓ, x - y ⟩ - β D_{ψ} (y, x) \leq \frac{2}{β} ∥ ℓ ∥_{x, ψ}^{* 2}

⟨ ℓ, x - y ⟩ \leq ∥ ℓ ∥_{x, ψ}^{*} ∥ x - y ∥_{x, ψ} \leq \frac{2}{β} ∥ ℓ ∥_{x, ψ}^{* 2} + \frac{β}{8} ∥ x - y ∥_{x, ψ}^{2} .

⟨ ℓ, x - y ⟩ \leq ∥ ℓ ∥_{x, ψ}^{*} ∥ x - y ∥_{x, ψ} \leq \frac{2}{β} ∥ ℓ ∥_{x, ψ}^{* 2} + \frac{β}{8} ∥ x - y ∥_{x, ψ}^{2} .

∥ x - y ∥_{ξ, ψ}^{2} \geq (1 - ∥ ξ - x ∥_{x, ψ})^{2} ∥ x - y ∥_{x, ψ}^{2} = (1 - α ∥ x - y ∥_{x, ψ})^{2} ∥ x - y ∥_{x, ψ}^{2} \geq \frac{1}{4} ∥ x - y ∥_{x, ψ}^{2} .

∥ x - y ∥_{ξ, ψ}^{2} \geq (1 - ∥ ξ - x ∥_{x, ψ})^{2} ∥ x - y ∥_{x, ψ}^{2} = (1 - α ∥ x - y ∥_{x, ψ})^{2} ∥ x - y ∥_{x, ψ}^{2} \geq \frac{1}{4} ∥ x - y ∥_{x, ψ}^{2} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Data Stream Mining Techniques

Full text

Best-of-Three-Worlds Linear Bandit Algorithm

with Variance-Adaptive Regret Bounds

Shinji Ito NEC Corporation. Email: [email protected], [email protected].

Kei Takemura∗

Abstract

This paper proposes a linear bandit algorithm that is adaptive to environments at two different levels of hierarchy. At the higher level, the proposed algorithm adapts to a variety of types of environments. More precisely, it achieves best-of-three-worlds regret bounds, i.e., of ${O}(\sqrt{T\log T})$ for adversarial environments and of $O(\frac{\log T}{\Delta_{\min}}+\sqrt{\frac{C\log T}{\Delta_{\min}}})$ for stochastic environments with adversarial corruptions, where $T$ , $\Delta_{\min}$ , and $C$ denote, respectively, the time horizon, the minimum sub-optimality gap, and the total amount of the corruption. Note that polynomial factors in the dimensionality are omitted here. At the lower level, in each of the adversarial and stochastic regimes, the proposed algorithm adapts to certain environmental characteristics, thereby performing better. The proposed algorithm has data-dependent regret bounds that depend on all of the cumulative loss for the optimal action, the total quadratic variation, and the path-length of the loss vector sequence. In addition, for stochastic environments, the proposed algorithm has a variance-adaptive regret bound of $O(\frac{\sigma^{2}\log T}{\Delta_{\min}})$ as well, where $\sigma^{2}$ denotes the maximum variance of the feedback loss. The proposed algorithm is based on the SCRiBLe algorithm. By incorporating into this a new technique we call scaled-up sampling, we obtain high-level adaptability, and by incorporating the technique of optimistic online learning, we obtain low-level adaptability.

1 Introduction

This paper considers linear bandit problems. In this class of problems, a player chooses, in each round $t$ , an action $a_{t}$ from a given action set $\mathcal{A}$ , which is a subset of a $d$ -dimensional linear space. The player then observes the incurred loss $f_{t}(a_{t})\in[-1,1]$ , where the (conditional) expectation of $f_{t}$ is assumed to be a linear function, i.e., $f_{t}$ is expressed as $f_{t}(a)=\left\langle\ell_{t},a\right\rangle+\varepsilon_{t}(a)$ with some vector $\ell_{t}\in\mathbb{R}^{d}$ and some noise $\varepsilon_{t}$ . The performance of the player is evaluated in terms of of regret $R_{T}$ defined as $R_{T}(a^{*})=\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}f_{t}(a_{t})-\sum_{t=1}^{T}f_{t}(a^{*})\right]$ and $R_{T}=\max_{a^{*}\in\mathcal{A}}R_{T}(a^{*})$ .

Algorithms for linear bandit problems have been proposed mainly for two different types of environments: stochastic and adversarial. In stochastic environments, $\{f_{t}\}$ are assumed to follow an unknown distribution $\mathcal{D}$ independently for all $t$ . Consequently, we may assume that there exists $\ell^{*}\in\mathbb{R}^{d}$ such that $\ell_{t}=\ell^{*}$ and $\varepsilon_{t}(a)$ follows an identical distribution for all $t$ .111 In standard stochastic settings, it is assumed that $\varepsilon_{t}(a)$ follows a zero-mean distribution for all $a\in\mathcal{A}$ . Our proposed algorithm, however, works well under milder assumptions, details of which are given in Section 2.1 and Remark 3.

In adversarial environments, the distributions of $f_{t}$ (and thus also $\ell_{t}$ ) are decided arbitrarily depending on the action sequence $(a_{s})_{s=1}^{t-1}$ that the player has chosen so far.

What we can do in linear bandit problems varies greatly depending on the type of environment. For stochastic environments, it is known that the optimal regret is of $\Theta(\log T)$ (Lattimore and Szepesvari, 2017), ignoring the factor dependent on $d,\mathcal{A}$ and $\ell^{*}$ . For adversarial environments, the mini-max optimal regret is $\tilde{\Theta}(d\sqrt{T})$ (Bubeck et al., 2012), where we ignore poly-logarithmic factors with respect to $d$ and $T$ in the notation of $\tilde{O},\tilde{\Omega}$ and $\tilde{\Theta}$ . A class of intermediate settings between these two types of environments are called stochastic environments with adversarial corruption (Lykouris et al., 2018), or corrupted stochastic environments. Environments in this regime are parametrized by corruption level $C$ , which measures the amount of the adversarial component. For this setting, an algorithm achieving $O((\log T)^{2}+C)$ -regret has been proposed (Lee et al., 2021).

The aim of this paper is to make possible the construction of adaptive algorithms that automatically exploit certain specific characteristics of environments. In existing studies of bandit algorithms, the concept of adaptability has been considered at two different levels, regarding which we here refer to high-level adaptability and low-level adaptability. Algorithms with high-level adaptability are designed to work well for different types of environments, e.g., stochastic and adversarial types. Algorithms with low-level adaptability perform better in specific individual environments by exploiting certain favorable characteristics that they possess, e.g., small cumulative loss or small variance in loss sequences.

High-level-adaptive bandit algorithms that perform (nearly) optimally for both stochastic and adversarial environments are called best-of-both-worlds (BOBW) algorithms (Bubeck and Slivkins, 2012). Among such algorithms, those that can adapt to corrupted stochastic environments are referred to as best-of-all-worlds (Erez and Koren, 2021) or best-of-three-worlds (BOTW) algorithms (Lee et al., 2021). For linear bandit problems, Lee et al. (2021) provide a best-of-three-worlds algorithm that achieves regret bounds of $O((\log T)^{2})$ for stochastic environments, of $\tilde{O}(\sqrt{T})$ for adversarial environments, and of ${O}((\log T)^{2}+C)$ for corrupted stochastic environments.

Various types of low-level-adaptive algorithms have been considered for adversarial bandit problems. Representative examples are algorithms with $\tilde{O}(\sqrt{L^{*}})$ -regret, where $L^{*}$ represents the cumulative loss for the optimal action; such examples are said to have first-order regret bounds. In addition to such algorithms, Hazan and Kale (2011) proposed an algorithm with a second-order regret bound of $\tilde{O}(\sqrt{Q})$ that depends on the total quadratic variation $Q$ of loss vectors. An algorithm by Ito (2021) achieves $\tilde{O}(\sqrt{\min\{L^{*},Q,P\}})$ -regret, which means that the algorithm simultaneously has first-order and second-order bounds as well as a bound depending on the path-length $P$ of the loss sequence. These regret bounds, which are referred to as data-dependent regret bounds, imply that algorithm performance can be improved by exploiting certain environmental characteristics that are common in applications, such as small variations in loss sequences or sparsity of loss. For the stochastic multi-armed bandit problem, Audibert et al. (2007) proposed an algorithm with an $O(\sum_{i}(\frac{\sigma_{i}^{2}}{\Delta_{i}}+1)\log T)$ -regret bound that depended not only on the sub-optimality gap $\Delta_{i}$ but also on the variance $\sigma_{i}^{2}$ of the loss. We refer to such bounds as variance-adaptive bounds, and they can be considered to represent low-level adaptability in stochastic regimes.

1.1 Contribution of this work

The main contribution of this paper is the proposal of a linear bandit algorithm that combines high-level adaptability and low-level adaptability. It is a BOTW algorithm that achieves regret bounds of $O(\log T)$ in stochastic environments, $\tilde{O}(\sqrt{T})$ in adversarial environments, and $O(\log T+\sqrt{C\log T})$ in corrupted stochastic environments, ignoring factors depending on $d,\mathcal{A}$ and $\ell^{*}$ . Further, the algorithm achieves first-order, second-order, and path-length bounds in adversarial environments. Simultaneously, it has variance-adaptive regret bounds for (corrupted) stochastic environments.

The proposed algorithm (Algorithm 1) follows the approach of SCRiBLe (Abernethy et al., 2008a, 2012) which stands for Self-Concordant Regularization in Bandit Learning. This approach uses a class of functions known as self-concordant barriers (Nesterov and Nemirovskii, 1994) as regularizers. Self-concordant barriers are characterized with a parameter $\vartheta\geq 1$ that can be assumed to satisfy $\vartheta=O(d)$ , details of which are given in Section 2.3. The regret bounds of our algorithm can be expressed with parameters explained in Table 1, including the parameter $\vartheta$ , as follows:

Theorem 1 (informal).

In adversarial environments with $\varepsilon_{t}(a)=0$ , the regret for Algorithm 1 is bounded as $R_{T}=O\left(d\sqrt{\vartheta\min\{T,Q,P\}\log T}\right)$ . Further, if $f_{t}(a)\geq 0$ for all $a\in\mathcal{A}$ and $t\in[T]$ , we have $R_{T}=O\left(d\sqrt{\vartheta L^{*}\log T}\right)$ . In stochastic environments (i.e., if $\ell_{t}=\ell^{*}$ for all $t$ ), we have $R_{T}=O\left((\frac{d\sigma^{2}}{\Delta_{\min}}+1)d\vartheta\log T\right)$ . In corrupted stochastic environments with the corruption level $C=\sum_{t=1}^{T}\|\ell_{t}-\ell^{*}\|_{2}$ , we have $R_{T}=O\left((\frac{d\sigma^{2}}{\Delta_{\min}}+1)d\vartheta\log T+\sqrt{C\cdot(\frac{\sigma^{2}}{\Delta_{\min}}+1)d^{2}\vartheta\log T}\right)$ .

Table 2 provides a comparison of our regret bounds with those in existing studies. For stochastic settings, the tight asymptotic regret given $\mathcal{A}$ and $\ell^{*}$ can be characterized with $c^{*}=c(\mathcal{A},\ell^{*})$ , a definition of which can be found in, e.g., the paper by Lattimore and Szepesvari (2017). They have provided an algorithm that achieves an asymptotically optimal regret bound of $R_{T}=c^{*}\log T+o(\log T)$ . However, such asymptotically optimal algorithms are not necessarily optimal in environments with small variance $\sigma^{2}$ . In the case of $c^{*}=\Omega\left((\frac{d\sigma^{2}}{\Delta_{\min}}+1)d\vartheta\right)$ , the proposed algorithm provides a better regret bound. We would also like to emphasize the fact that our stochastic regret bound includes only a single $\log T$ factor, while the bound by the BOTW algorithm of Lee et al. (2021) includes a $(\log T)^{2}$ factor.

For adversarial environments, Ito (2021) has provided an algorithm with data-dependent regret bounds that depend on $L^{*}$ , $Q$ , and $P$ simultaneously. In this regard, our regret bounds here have an additional factor of $\sqrt{d\vartheta}$ but are better in terms of the dependency w.r.t. $\log T$ . Our regret bounds can be better than those with BOTW algorithm by Lee et al. (2021) if the loss sequence satisfies $\min\{L^{*},Q,P\}=O\left(T\log(T|\mathcal{A}|)/\sqrt{d\vartheta\log T}\right)$ .

For corrupted stochastic environments, the BOTW algorithm by Lee et al. (2021) achieves a regret bound of $O\left(\frac{d\log(T|\mathcal{A}|)\log T}{\Delta_{\min}}+C\right)$ , while our bound is $O\left(\mathcal{R}^{\mathrm{sto}}+\sqrt{\mathcal{R}^{\mathrm{sto}}C}\right)$ , where $\mathcal{R}^{\mathrm{sto}}$ satisfies $\mathcal{R}^{\mathrm{sto}}\leq O\left((\frac{\sigma^{2}}{\Delta_{\min}}+1)d^{2}\vartheta\log T\right)$ . As $\sqrt{\mathcal{R}^{\mathrm{sto}}C}\leq\frac{1}{2}(\mathcal{R}^{\mathrm{sto}}+C)$ follows from the AM-GM inequality, our algorithm also implies $R_{T}=O\left(\mathcal{R}^{\mathrm{sto}}+C\right)$ , which is superior to the bound by Lee et al. (2021) when $\sigma^{2}+\Delta_{\min}=O\left(\frac{\log(T|\mathcal{A}|)}{d\vartheta}\right)$ . We would like to stress here that the impact of corruption on the performance of our algorithm is only of a square-root factor in $C$ , while algorithms in existing studies (Lee et al., 2021; Li et al., 2019; Bogunovic et al., 2021) include at least a linear factor in $C$ . Comparison of such results w.r.t. corrupted settings, however, requires particular care, as there are differences in the details of problem settings.

Remark 1.

In this paper, regret is defined in terms of loss including corruption, while existing studies define regret in terms of loss without corruption. As the difference between these two notions of regret is at most $O(C)$ , our algorithm enjoys $O\left(\mathcal{R}^{\mathrm{sto}}+C\right)$ -bound even in terms of the latter definition of regret. Such a difference in models has been discussed by Gupta et al. (2019).

The main innovations for achieving high-level adaptability (i.e., the BOTW property) are with regard to the sampling method for actions. In a previous study by Abernethy et al. (2008b), they compute a point $x_{t}$ in the convex hull $\mathrm{conv}(\mathcal{A})$ of the action set $\mathcal{A}$ using a follow-the-regularized-leader (FTRL) approach, and they then pick $a_{t}$ from the Dikin ellipsoid $W_{1}(x_{t})\subseteq\mathrm{conv}(\mathcal{A})$ that is defined from the self-concordant barrier for $\mathrm{conv}(\mathcal{A})$ . Here, the action $a_{t}$ must be sampled so that its expectation matches $x_{t}$ . In addition, the larger the variance of $a_{t}$ , the better estimator $\hat{\ell}_{t}$ for $\ell_{t}$ that we can construct, i.e., the smaller variance of $\hat{\ell}_{t}$ . In this paper, in order to improve the variance of the loss estimator, we introduce a new technique that we refer to as scaled-up sampling (see Figure 1). In this approach, we construct a scaled-up set $W^{\prime}\subseteq\mathrm{conv}(\mathcal{A})$ of the Dikin ellipsoid $W_{1}(x_{t})$ with a reference point $z_{t}\in\mathcal{A}$ , for which we let $\alpha_{t}\geq 1$ denote the scaling factor. Rather than sampling from $W_{1}(x_{t})$ as is done in the previous study, we pick $a_{t}$ from $W^{\prime}$ with probability $\alpha_{t}^{-1}$ , and otherwise set $a_{t}=z_{t}$ (the expectation of $a_{t}$ then matches $x_{t}$ as well). Consequently, the variance of $a_{t}$ becomes $\alpha_{t}$ times larger and the variance of the loss-vector estimator becomes $\alpha_{t}^{-1}$ times smaller, which contributes to the improvement of the regret upper bound. In stochastic environments in particular, intuitively, $x_{t}$ approaches an extreme point (a truly optimal solution), allowing for a smaller $W_{1}(x_{t})$ and a larger value of $\alpha_{t}$ , which leads to a significant improvement in regret.

In proving the high-level adaptability, we use the self-bounding technique (Zimmert and Seldin, 2021; Wei and Luo, 2018). We first show that the proposed algorithm, an FTRL method with scaled-up sampling and an adaptive learning rate, achieves a regret bound of $R_{T}=O\left(d\sqrt{\vartheta\sum_{t=1}^{T}\alpha_{t}^{-1}\log T}\right)$ . We further show that $\alpha_{t}^{-1}=O(\Delta(x_{t})/\Delta_{\min})$ holds in any stochastic environment, where $\Delta(x_{t})$ denotes the round-wise regret caused by choosing $x_{t}$ . Combining these two facts, we can obtain $R_{T}=O\left(d\sqrt{\vartheta\sum_{t=1}^{T}\Delta(x_{t})\Delta_{\min}^{-1}\log T}\right)=O\left(d\sqrt{\vartheta R_{T}\Delta_{\min}^{-1}\log T}\right)$ , which immediately leads to $R_{T}=O(d^{2}\vartheta\Delta_{\min}^{-1}\log T)$ in stochastic environments. As has been done in previous analyses using the self-bounding technique, we can prove improved regret bounds for the stochastically constrained adversarial regime (Zimmert and Seldin, 2021) as well, which includes corrupted stochastic environments.

To achieve low-level adaptability (i.e., data-dependent bounds in adversarial environments and variance-adaptive bounds in stochastic environments), we employ the framework of optimistic online learning (Rakhlin and Sridharan, 2013). This framework incorporates optimistic prediction $m_{t}$ for $\ell_{t}$ into online learning algorithms, thereby providing regret bounds depending on $(\left\langle\ell_{t}-m_{t},a_{t}\right\rangle)^{2}$ rather than $(\left\langle\ell_{t},a_{t}\right\rangle)^{2}$ . The proposed algorithm determines $m_{t}$ by means of the technique of tracking the best linear predictor, which leads the hybrid data-dependent bounds and variance-adaptive bounds. Similar approaches can be found in (Ito, 2021; Ito et al., 2022).

1.2 Limitation of this work and future work

We should note the issue of computational complexity w.r.t. the proposed algorithm. In the proof of $O(\log T)$ -regret bounds for (corrupted) stochastic environments, we need the assumption that the reference point $z_{t}$ , illustrated in Figure 1, is chosen so that the scaling factor $\alpha_{t}$ is (approximately) maximized. We have not, however, found an efficient method for computing such a point $z_{t}$ . A naive method for computing such a $z_{t}$ requires a computational time of at least $\Omega(|\mathcal{A}|)$ , which is highly expensive, e.g., as in most examples of combinatorial bandits (Cesa-Bianchi and Lugosi, 2012). Resolving this issue of computational complexity will be important in future work.

There is still some room for improvement in terms of regret bounds as well. As can be seen from Example 4 by Lattimore and Szepesvari (2017), the gap between $c^{*}$ and $d/\Delta_{\min}$ can be arbitrarily large, which implies that our stochastic regret bound is much larger than the lower bound in the worst case. We also note that our regret bounds only hold in expectation while regret guarantees by Lee et al. (2021) hold with high probability. If we pursue high probability bounds as well, we cannot avoid an extra $O(\log T)$ factor, as discussed in their Appendix D,

2 Preliminary

2.1 Problem setup

This section introduces the setup of the linear bandit problems dealt with in this paper. Before a game starts, the player is given the time horizon $T$ and an action set $\mathcal{A}\subseteq\mathbb{R}^{d}$ , a closed and bounded set of $d$ -dimensional vectors. Without loss of generality, we assume that $\mathcal{A}$ is not included in any proper affine subspace of $\mathbb{R}^{d}$ . We also assume that all points in $\mathcal{A}$ have $L_{2}$ norm of at most $1$ , i.e., $\mathcal{A}\subseteq B_{2}^{d}(1)$ , where $B_{2}^{d}(r)$ denotes an $L_{2}$ ball of the radius $r$ : $B_{2}^{d}(r)=\{x\in\mathbb{R}^{d}\mid\|x\|_{2}\leq r\}$ . In each round $t=1,2,\ldots,T$ , the environment determines a loss function $f_{t}:\mathcal{A}\rightarrow[-1,1]$ , and the player then chooses an action $a_{t}\in\mathcal{A}$ without knowing $f_{t}$ . After that, the player observes the incurred loss $f_{t}(a_{t})$ . The loss function $f_{t}$ can be chosen depending on the actions $(a_{s})_{s=1}^{t-1}$ selected so far. We assume that the conditional expectation of $f_{t}$ given $(a_{s})_{s=1}^{t-1}$ is an affine function, i.e., there exists $\ell_{t}\in\mathbb{R}^{d}$ , which is referred to as a loss vector, and $\xi_{t}\in\mathbb{R}$ such that $f_{t}$ is expressed as

[TABLE]

This paper also assumes that $\ell_{t}\in B_{2}^{d}(1)$ . By imposing further conditions on $f_{t}$ , we can express a variety of regimes, as are discussed below:

Stochastic regime

In a stochastic regime, it is assumed that $f_{t}$ follows an unknown distribution $\mathcal{D}$ for all $t\in[T]$ independently. This assumption implies that $\ell_{t}$ and $\xi_{t}$ do not change over all rounds, i.e., there exists a true loss vector $\ell^{*}\in\mathbb{R}^{d}$ and $\xi^{*}$ such that $\ell_{t}=\ell^{*}$ and $\xi_{t}=\xi^{*}$ hold for all $t\in[T]$ . Note here that standard stochastic settings also assume that functions of $\varepsilon_{t}$ represent zero-mean noise, i.e., $\xi^{*}=0$ . This assumption is, however, not necessary in the algorithm proposed in this paper. Moreover, the proposed algorithm does not even require the assumption that $\varepsilon_{t}$ follows an identical distribution, details of which will be discussed in Section 4.1.

Adversarial regime

In the adversarial regime, by way of contrast to the stochastic regime, $(\ell_{t})_{t=1}^{T}$ is an arbitrary sequence. More precisely, $\ell_{t}$ can be chosen in an adversarial way depending on $(a_{s})_{s=1}^{t-1}$ . Though adversarial environments considered in previous studies are often free from noise, i.e., $\varepsilon_{t}(a)=0$ is assumed, most algorithms work well as long as the noise follows bounded zero-mean distributions. The proposed algorithm in this paper does not require this assumption as well.

Stochastic regime with adversarial corruption

The stochastic regime with adversarial corruption is an intermediate regime between stochastic and adversarial regimes. It is parametrized by a true loss vector $\ell^{*}\in B_{2}^{d}(1)$ and by a corruption level $C\geq 0$ . In this regime, the sequence of $(\ell_{t})_{t=1}^{T}$ is subject to the constraint that $\sum_{t=1}^{T}\|\ell_{t}-\ell^{*}\|_{2}\leq C$ . This can be interpreted as a situation in which an adversary adds a corruption of $c_{t}=\ell_{t}-\ell^{*}$ to the loss function defined by $\ell^{*}$ and the magnitude of $c_{t}$ sums up to $C$ at most, i.e., $\sum_{t=1}^{T}\|c_{t}\|_{2}\leq C$ . If we set the condition level $C$ to zero, this regime coincides with the stochastic regime. On the other hand, if $C=\Omega(T)$ , then the regime is adversarial as there are no constraints on $\ell_{t}$ except for $\|\ell_{t}\|_{2}\leq 1$ .

2.2 Follow the regularized leader

In the proposed algorithm, we use the framework of (optimistic) follow-the-regularized-leader (FTRL) methods. In this framework, we choose a point $x_{t}$ in a closed convex set $\mathcal{X}\subseteq\mathbb{R}^{d}$ by solving the following optimization problem:

[TABLE]

where $\hat{\ell}_{s}$ is the (estimated) loss vector, $m_{t}\in\mathbb{R}^{d}$ is an optimistic prediction, and $\psi_{t}(x)$ is a regularization term, which is a differentiable convex function over $\mathcal{X}$ . Note that the original FTRL framework here does not employ optimistic prediction, i.e., the value of $m_{t}$ is fixed to [math]. The technique of optimistic prediction $m_{t}$ has been introduced to further improve the performance of FTRL, e.g., by Rakhlin and Sridharan (2013).

In the analysis of FTRL, we use the Bregman divergence $D_{\psi}$ associated with some differentiable convex function $\psi$ defined as follows:

[TABLE]

where $\nabla\psi(y)$ denotes the gradient of $\psi$ at $y$ . We can easily see that $D_{\psi}(x,y)\geq 0$ for any $x$ and $y$ , which follows from the convexity of $\psi$ . The following lemma provides an upper bound of the regret for FTRL:

Lemma 1.

We assume that $\psi_{1}(x)\geq 0$ and $\psi_{t+1}(x)\geq\psi_{t}(x)$ hold for all $x$ and $t$ . If $x_{t}$ is given by (2), it holds for any $x^{*}\in\mathrm{int}(\mathcal{X})$ that

[TABLE]

where $\tilde{x}_{t}$ is defined by $\tilde{x}_{t}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}\left\{\left\langle\sum_{s=1}^{t-1}\hat{\ell}_{s},x\right\rangle+\psi_{t}(x)\right\}$ .

This lemma can be shown via a standard analysis for FTRL, e.g., as in Chapter 28 of Lattimore and Szepesvári (2019). We can also refer to, e.g., the proof of Lemma 1 by Ito et al. (2022).

2.3 Self-concordant barriers

In our proposed algorithm, we use self-concordant barriers to define regularization terms, just as Abernethy et al. (2008b) did. Self-concordant barriers are defined as follows:

Definition 1.

A convex function $\psi:\mathrm{int}(\mathcal{X})\rightarrow\mathbb{R}$ of class $C^{3}$ is called a self-concordant function if (i) $|D^{3}\psi(x)[h,h,h]|\leq 2(D^{2}\psi(x)[h,h])^{3/2}$ holds for any $x\in\mathrm{int}(\mathcal{A})$ and $h\in\mathbb{R}^{d}$ , and (ii) $\psi(x_{i})$ tends to infinity along every sequence $x_{1},x_{2},\ldots\in\mathrm{int}(\mathcal{X})$ converging to a boundary point of $\mathrm{int}(\mathcal{X})$ , where $D^{k}\psi(x)[h_{1},\ldots,h_{k}]$ denotes the value of the $k$ -th differential of $\psi$ at $x$ along the directions $h_{1},\ldots,h_{k}$ . Let $\vartheta\geq 0$ be a non-negative real number. A self-concordant function $\psi:\mathrm{int}(\mathcal{X})\rightarrow\mathbb{R}$ is called a $\vartheta$ -self-concordant barrier for $\mathcal{X}$ if $|D\psi(x)[h]|\leq\vartheta^{1/2}(D^{2}\psi(x)[h,h])^{1/2}$ holds for any $x\in\mathrm{int}(\mathcal{X})$ and $h\in\mathbb{R}^{d}$ .

Remark 2.

For any convex set $\mathcal{X}\subseteq\mathbb{R}^{d}$ , there exists a $d$ -self-concordant barrier for $\mathcal{X}$ (Lee and Yue, 2021). This barrier is, however, not always efficiently computable. On the other hand, for any $d$ -dimensional polytope, we can compute an $\vartheta$ -self-concordant barrier with $\vartheta={O}(d)$ in polynomial time (Lee and Sidford, 2014, 2019).

Given a self-concordant barrier $\psi:\mathrm{int}(\mathcal{X})\rightarrow\mathbb{R}$ , for any $x\in\mathrm{int}(\mathcal{X})$ and $h\in\mathbb{R}^{d}$ , we assume that $\nabla^{2}\psi(x)$ has full rank. Denote

[TABLE]

and define the Dikin’s ellipsoid $W_{r}(x)\subseteq\mathbb{R}^{d}$ of $\psi$ centered at $x$ of the radius $r>0$ as follows:

[TABLE]

The three lemmas below are used in the design and analysis of our proposed algorithm.

Lemma 2 (Theorem 2.1.2 by Nesterov and Nemirovskii (1994)).

If $\psi$ is a self-concordant barrier for a closed convex set $\mathcal{X}$ , every Dikin’s ellipsoid of $\psi$ of radius $1$ is contained in $\mathcal{X}$ , i.e., $W_{1}(x)\subseteq\mathcal{X}$ holds for any $x\in\mathrm{int}(\mathcal{X})$ .

Let $\pi_{z,\mathcal{X}}(x)$ denote the Minkowsky function of $\mathcal{X}$ whose pole is at $z$ :

[TABLE]

We have an upper bound on $\psi$ expressed with this Minkowsky function, as follows:

Lemma 3 (Propositoin 2.3.2 by Nesterov and Nemirovskii (1994)).

If $\psi$ is a $\vartheta$ -self-concordant barrier for $\mathcal{X}$ , it holds for any $x$ and $y$ in $\mathrm{int}(\mathcal{X})$ that $\psi(x)\leq\psi(y)+\vartheta\log\frac{1}{1-\pi_{y,\mathcal{X}}(x)}$ .

If we use a self-concordant barrier $\psi$ , we can use the following lemma to bound the stability term $\left(\left\langle\hat{\ell}_{t}-m_{t},x_{t}-x^{\prime}_{t+1}\right\rangle-D_{\psi_{t}}(x^{\prime}_{t+1},x_{t})\right)$ in Lemma 1.

Lemma 4.

Let $\psi$ be a self-concordant function on $\mathcal{X}$ and $x,y\in\mathrm{int}(\mathcal{X})$ . Let $\beta>0$ and $\ell\in\mathbb{R}^{d}$ . Suppose that $\|\ell\|_{x,\psi}^{*}\leq\beta/3$ . We then have $\langle\ell,x-y\rangle-\beta D_{\psi}(y,x)\leq\frac{2}{\beta}\|\ell\|_{x,\psi}^{*2}.$

3 Algorithm

Let $\mathcal{X}$ be the convex hull of $\mathcal{A}$ and $\psi$ be a $\vartheta$ -self-concordant barrier for $\mathcal{X}$ . In the proposed algorithm, we compute $x_{t}$ by solving the optimization problem (2) with $\psi_{t}(x)=\beta_{t}\psi(x)$ , where $\beta_{t}$ is a learning rate parameter satisfying $6d\leq\beta_{1}\leq\beta_{2}\leq\cdots$ . The manner of computing $a_{t}$ , $\hat{\ell}_{t}$ , $m_{t}$ , and $\beta_{t}$ will be presented below.

Action $a_{t}$ and unbiased estimator $\hat{\ell}_{t}$ for loss vector

After computing $x_{t}$ , we choose the action $a_{t}\in\mathcal{A}$ so that $\operatorname*{\mathbf{E}}[a_{t}|x_{t}]=x_{t}$ . Let $\{e_{1},\ldots,e_{d}\}$ and $\{\lambda_{1},\ldots,\lambda_{d}\}$ be the set of eigenvectors and eigenvalues of $\nabla^{2}\psi(x_{t})$ . Define $\mathcal{E}_{t}:=\{x_{t}+\lambda_{i}^{-1/2}e_{i}\mid i\in[d]\}\cup\{x_{t}-\lambda_{i}^{-1/2}e_{i}\mid i\in[d]\}$ . Note that here $\mathcal{E}_{t}\subseteq\mathcal{X}$ holds since $\mathcal{E}_{t}\subseteq W_{1}(x_{t})$ follows from the definition of $\mathcal{E}_{t}$ and since $W_{1}(x)\subseteq\mathcal{X}$ follows from Lemma 2. In the algorithm by Abernethy et al. (2008b), the action $a_{t}$ is chosen from $\mathcal{E}_{t}$ uniformly at random. Unlike this existing method, our proposed algorithm chooses an action from a set $\mathcal{E}^{\prime}_{t}$ scaled up from $\mathcal{E}_{t}$ with a reference point $z_{t}\in\mathcal{A}$ , or chooses $a_{t}=z_{t}$ with some probability. More precisely, after computing $\mathcal{E}_{t}$ and choosing a point $z_{t}\in\mathcal{A}$ , we set $\mathcal{E}^{\prime}_{t}$ by

[TABLE]

where $\alpha_{t}\geq 1$ is defined as the largest real number such that $\mathcal{E}^{\prime}_{t}$ is included in $\mathcal{X}$ . How to choose $z_{t}$ is discussed in the next pragraph. If we denote $r_{t}=\alpha_{t}^{-1}\in(0,1]$ , we can express $r_{t}$ as follows:

[TABLE]

where $\pi$ is the Minkowsky function defined by (7). We choose $z_{t}\in\mathcal{A}$ so that the value of $r_{t}$ is as small as possible. Let $x^{\prime}_{t}$ denote the center of $\mathcal{E}^{\prime}_{t}$ , i.e., define $x^{\prime}_{t}=z_{t}+r_{t}^{-1}(x_{t}-z_{t})$ . We then set $b_{t}=1$ with probability $r_{t}$ and $b_{t}=0$ with probability $1-r_{t}$ . If $b_{t}=0$ , we choose $a^{\prime}_{t}=z_{t}$ . If $b_{t}=1$ , we choose $a^{\prime}_{t}$ from $\mathcal{E}^{\prime}_{t}$ uniformly at random. In other words, we pick $i_{t}$ uniformly at random from $[d]$ and $v_{t}=\pm 1$ with probability $1/2$ , and set $a^{\prime}_{t}=z_{t}+r_{t}^{-1}(x_{t}+v_{t}\lambda_{i_{t}}^{-1/2}e_{i_{t}}-z_{t})$ . We then output $a_{t}\in\mathcal{A}$ so that its expectation coincides with $a^{\prime}_{t}\in\mathcal{X}=\mathrm{conv}(\mathcal{A})$ . After obtaining feedback of $f_{t}(a_{t})$ , we define $\hat{\ell}_{t}$ by

[TABLE]

We can show that the conditional expectation of $a_{t}$ is equal to $x_{t}$ and that $\hat{\ell}_{t}$ is an unbiased estimator of $\ell_{t}$ , i.e., we have $\operatorname*{\mathbf{E}}\left[a_{t}|x_{t}\right]=x_{t}$ and $\operatorname*{\mathbf{E}}[\hat{\ell}_{t}|x_{t}]=\ell_{t}$ , proofs of which are given in Section D in the appendix. We note that, thanks to the scaled-up sampling, the mean square of $\hat{\ell}_{t}-m_{t}$ is improved by a factor of $1/\alpha_{t}$ , which plays a central role in our proof of BOTW regret bounds.

Reference point $z_{t}$

We will see that the smaller value of $r_{t}$ is, the smaller variance of $\hat{\ell}_{t}-m_{t}$ is, resulting in an improvement in regret. To take maximum advantage of this effect, we choose $z_{t}$ so that $r_{t}$ is as small as possible. More precisely, for a constant $\kappa\geq 1$ , we assume that $z_{t}$ satisfies

[TABLE]

for all $t\in[T]$ . This assumption is used in our proof of $O(\log T)$ -regret in stochastic environments.

Learning rate parameter $\beta_{t}$

In the regret analysis in Section 4, we will show that the regret for the proposed algorithm is bounded as $R_{T}=O\left(\operatorname*{\mathbf{E}}\left[d^{2}\sum_{t=1}^{T}\frac{g_{t}(m_{t})}{\beta_{t}}+\beta_{T+1}\vartheta\log T\right]\right)$ , where $g_{t}(m)$ is defined as

[TABLE]

Intuitively, $g_{t}(m_{t})/\beta_{t}$ comes from the part of $\left\langle\hat{\ell}_{t}-m_{t},x_{t}-\tilde{x}_{t+1}\right\rangle-D_{\psi_{t}}(\tilde{x}_{t+1},x_{t})$ in (4), which is called stability terms, and $\beta_{T+1}\log T$ comes from the part of $\psi_{T+1}(x^{*})$ , called penalty terms. To balance stability and penalty terms, we set $\beta_{t}$ by

[TABLE]

which leads to $R_{T}=O\left(d\operatorname*{\mathbf{E}}\left[\sqrt{\vartheta\log T\cdot\sum_{t=1}^{T}g_{t}(m_{t})}\right]+d\log T\right)$ .

Optimistic prediction $m_{t}$

To minimize the part of $\sum_{t=1}^{T}g_{t}(m_{t})$ , we choose $m_{t}$ by using online projected gradient descent for $g_{t}$ . We set $m_{1}=0$ and update $m_{t}$ as follows:

[TABLE]

where $\eta\in(0,1/4)$ is the learning rate parameter for updating $m_{t}$ .

The proposed algorithm can be summarized as Algorithm 1 in Section B in the appendix.

Computational complexity

The procedure in each round can be performed in polynomial time in $d$ , except for the computation of $z_{t}$ . Indeed, given a self-concordant barrier for $\mathcal{X}$ , we can solve an arbitrary linear optimization problem over $\mathcal{X}$ (and thus also over $\mathcal{A}$ ), with the aid of, e.g., interior point methods (Nesterov and Nemirovskii, 1994). This implies that convex optimization problems (2) can be solved in polynomial time as well. Futher, for any $a^{\prime}_{t}\in\mathcal{X}$ , we can find an expression of convex combination of points in $\mathcal{A}$ in polynomial time (Mirrokni et al., 2017; Schrijver, 1998, Corollary 11.4), which means that we can randomly choose $a_{t}\in\mathcal{A}$ so that $\operatorname*{\mathbf{E}}[a_{t}|a^{\prime}_{t}]=a^{\prime}_{t}$ . As for the calculation of $z_{t}$ satisfying (11), it is not clear if there is a computationally efficient way at this point. Because we can compute the value of $\max_{x\in\mathcal{E}_{t}}\pi_{z,\mathcal{X}}(x)$ for any $z\in\mathcal{A}$ in polynomial time in $d$ , we can find $z_{t}$ minimizing this value in $O(\mathrm{poly}(d)|\mathcal{A}|)$ time, which can be exponential in $d$ .

4 Analysis

4.1 Regret bounds for the proposed algorithm

Theorem 2 (Regret bounds in the adversarial regime).

Let $L^{*}$ , $Q$ and $P$ be parameters defined as in Table 1. The regret for Algorithm 1 is bounded as

[TABLE]

Further, if $f_{t}(a)\geq 0$ for any $a\in\mathcal{A}$ and $t\in[T]$ , we have

[TABLE]

Note that the regret bounds in Theorem 2 are valid regardless of the choice of $z_{t}$ . In fact, we can demonstrate these regret bounds even if we sample $a_{t}$ from $\mathcal{E}_{t}$ , as is similarly done with the algorithm by Abernethy et al. (2008b), which corresponds to $r_{t}=1$ . By way of contrast, to show $O(\log T)$ -regret bounds for stochastic environments, we need the assumption of (11). Under this assumption, we have the following regret bounds:

Theorem 3 (Regret bounds in the corrupted stochastic regime).

Let $\ell^{*}\in\mathbb{R}^{d}$ and denote $C=\sum_{t=1}^{T}\|\ell_{t}-\ell^{*}\|_{2}$ . Define $a^{*}\in\operatorname*{arg\,min}_{a\in\mathcal{A}}\left\langle\ell^{*},a\right\rangle$ and $\Delta_{\min}=\min_{a\in\mathcal{A}\setminus\{a^{*}\}}\left\langle\ell^{*},a-a^{*}\right\rangle$ . We have $R_{T}=O\left(d\sqrt{\left(C+\sum_{t=1}^{T}\sigma_{t}^{2}\right)\vartheta\log T}+d\vartheta\log T\right)$ , where we define $\sigma_{t}^{2}=\max_{a\in\mathcal{A}}\operatorname*{\mathbf{E}}[(\varepsilon_{t}(a))^{2}]$ . Further, if $a^{*}\in\operatorname*{arg\,min}_{a\in\mathcal{A}}\left\langle\ell^{*},a\right\rangle$ exists uniquely, under the assumption of (11), we have

[TABLE]

where $\sigma^{2}=\max_{t\in[T]}\sigma_{t}^{2}$ .

Remark 3.

In standard settings of the stochastic regime, it is assumed that $f_{t}$ follows an identical distribution for different rounds and $\operatorname*{\mathbf{E}}[\varepsilon_{t}(a)|(a_{s})_{s=1}^{t-1}]=0$ for all $a$ . Such assumptions are not, however, needed in Theorem 3. In other words, even when $\xi_{t}=\operatorname*{\mathbf{E}}[\varepsilon_{t}(a)|(a_{s})_{s=1}^{t-1}]$ is non-zero and changes depending on $t$ , we still have the $O(\log T)$ -regret bounds given in Theorem 3.

4.2 Proof sketch

Regret bounds in Theorems 2 and 3 are derived from the following lemma:

Lemma 5.

The regret for Algorithm 1 is bounded as follows:

[TABLE]

In proving this lemma, we use Lemmas 1, 3 and 4. From Lemma 4, the stability term $\left\langle\hat{\ell}_{t}-m_{t},x_{t}-\tilde{x}_{t+1}\right\rangle-D_{\psi_{t}}(\tilde{x}_{t+1},x_{t})$ in Lemma 1 is bounded by $\frac{2}{\beta_{t}}\|\hat{\ell}_{t}-m_{t}\|_{x,\psi}^{*2}=\frac{2}{\beta_{t}}d^{2}g_{t}(m_{t})$ . From Lemma 3, we can bound the penalty term $\psi_{T+1}(x^{*})$ in Lemma 1 as $\psi_{T+1}(x^{*})\leq\beta_{T+1}\vartheta\log T$ . Combining these bounds, we obtain $R_{T}=O\left(\operatorname*{\mathbf{E}}\left[d^{2}\sum_{t=1}^{T}\frac{g_{t}(m_{t})}{\beta_{t}}+\beta_{T+1}\vartheta\log T\right]\right)$ . From this and the definition of $\beta_{t}$ given in (13), we have the regret bound in Lemma 5. A complete proof of this lemma is given in Section D in the appendix.

From the result of tracking linear experts (Herbster and Warmuth, 2001), we obtain the following upper bound on $\sum_{t=1}^{T}g_{t}(m_{t})$ .

Lemma 6.

If $m_{t}$ is given by (14), it holds for any sequence $(u_{t})_{t=1}^{T+1}\in(B_{2}^{d}(1))^{T+1}$ that

[TABLE]

This lemma is a special case of Theorem 11.4 by Cesa-Bianchi and Lugosi (2006).

Proof sketch of Theorem 2

By substituting $u_{t}=\bar{\ell}\in\operatorname*{arg\,min}_{\ell}\sum_{t=1}^{T}\|\ell_{t}-{\ell}\|_{2}^{2}$ for all $t$ in (19), we obtain $\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}g_{t}(m_{t})\right]=O\left(Q+\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}(\varepsilon_{t}(a_{t}))^{2}\right]+1\right)$ . Similarly, by substituting $u_{t}=\ell_{t}$ for all $t$ in (19), we obtain $\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}g_{t}(m_{t})\right]=O\left(P+\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}(\varepsilon_{t}(a_{t}))^{2}\right]+1\right)$ . Combining these with Lemma 5, we obtain (15) in Theorem 2. Further, if $f_{t}(a)\geq 0$ , by substituting $u_{t}=0$ , we obtain $\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}g_{t}(m_{t})\right]=O\left(L^{*}+R_{T}+1\right)$ , which leads to a regret bound of $R_{T}=O\left(d\sqrt{\vartheta\log T\left(L^{*}+R_{T}\right)}+d\vartheta\log T\right)$ . This implies that (16) in Theorem 2 holds.

Proof sketch of Theorem 3

By setting $u_{t}=\ell^{*}$ for all $t$ in (19), we obtain $\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}g_{t}(m_{t})\right]=O\left(\operatorname*{\mathbf{E}}\left[C+\sum_{t=1}^{T}r_{t}\sigma_{t}^{2}+1\right]\right)$ . As we have $r_{t}\leq 1$ , from this bound and Lemma 5, we have $R_{T}=O\left(d\sqrt{(C+\sum_{t=1}^{T}\sigma_{t}^{2})\vartheta\log T}+d\vartheta\log T\right)$ . We also have the following regret bound:

[TABLE]

From the assumption of (11), $r_{t}$ is bounded as

[TABLE]

where the second inequality follows from $\mathcal{E}_{t}\subseteq W_{1}(x_{t})$ . The following lemma provides an upper bound on the right-hand side of this:

Lemma 7.

Suppose $a^{*}\in\operatorname*{arg\,min}_{a\in\mathcal{A}}\left\langle\ell^{*},a\right\rangle$ uniquely exists. It holds for any $y\in\mathrm{int}(\mathcal{X})$ that

[TABLE]

By combining this lemma with (20) and (21), we obtain a bound depending on $\sum_{t=1}^{T}\Delta(x_{t})$ as follows: $R_{T}=O\left(d\sqrt{\left(C+\frac{\kappa\sigma^{2}}{\Delta_{\min}}\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}\Delta(x_{t})\right]\right)\vartheta\log T}+d\vartheta\log T\right).$ On the other hand, regret is bounded from below as $R_{T}(a^{*})\geq\operatorname*{\mathbf{E}}\left[\sum_{t=1}^{T}\Delta(x_{t})\right]-2C$ . By combining these two bounds on $R_{T}$ , we obtain

[TABLE]

As $X=O(\sqrt{AX}+B)$ implies $X=O(A+B)$ , we have

[TABLE]

which means that (17) holds. A complete proof is given in Section G of the appendix.

Appendix A Related Work

Best-of-Both-Worlds Bandit Algorithms

Best-of-both-worlds algorithms have been developed for various settings of multi-armed bandit (MAB) problems, including the standard MAB problem [Bubeck and Slivkins, 2012, Seldin and Slivkins, 2014, Zimmert and Seldin, 2021, Ito et al., 2022, Honda et al., 2023], combinatorial semi-bandits [Zimmert et al., 2019, Ito, 2021, Tsuchiya et al., 2023b], partial monitoring problems [Tsuchiya et al., 2023a], episodic Markov decision processes [Jin and Luo, 2020, Jin et al., 2021], and linear bandits [Lee et al., 2021]. While most of these studies focuses only on high-level adaptability, the algorithms by Ito et al. [2022], Tsuchiya et al. [2023a] for the MAB problem and combinatorial semi-bandit problems have low-level adaptability as well, similarly to our proposed algorithm. In fact, their algorithms are best-of-three-worlds algorithms with multiple data-dependent regret bounds as well as variance-adaptive regret bounds. Their algorithms are also similar to ours in that it is based on the optimistic follow-the-regularizer approach with an adaptive learning rate. As the class of linear bandits problem includes the multi-armed bandit problem, the results in this paper can be interpreted as an extension of their results. Regret bounds by Ito et al. [2022] are, however, better than ours in terms of the dependency on the dimensionality of the action set (or the number of arms) and in that they depend on arm-wise sub-optimality gaps.

Adversarial Corruption

There are several studies on the stochastic environment with adversarial corruption in the linear bandit problem [Li et al., 2019, Bogunovic et al., 2021, Lee et al., 2021] and the sibling problems such as the multi-armed bandits [Lykouris et al., 2018, Gupta et al., 2019, Zimmert and Seldin, 2021, Yang et al., 2020] and the linear Markov decision processes [Lykouris et al., 2021]. These studies and this paper have different assumptions and regret. This paper and Lee et al. [2021] assume that corruption depends only on information in the past rounds and is an affine function of the chosen action. On the other hand, Li et al. [2019], Bogunovic et al. [2021] allow corruption to be any (possibly non-linear) function. Furthermore, Bogunovic et al. [2021] consider the corruption that depends on the action chosen in that round. We also note that the definitions of the corruption level in these studies are slightly different. While this paper includes the corruption in regret, Li et al. [2019], Bogunovic et al. [2021], Lee et al. [2021] do not. It is known that we can convert one to the other by an additional $O(C)$ -regret. Moreover, the regret bounds in these existing studies have linear terms with respect to $C$ . Thus, our regret bound for the corrupted stochastic regime have the same dependence of the corruption as in these studies, but not vice versa.

Misspecified Linear Contextual Bandits

The corrupted stochastic regime is a special case of the misspecified linear contextual bandits without knowledge of the misspecification [Lattimore et al., 2020, Foster et al., 2020, Pacchiano et al., 2020, Takemura et al., 2021, Krishnamurthy et al., 2021].222 Note that some studies assume oblivious adversary [Lattimore et al., 2020, Foster et al., 2020, Krishnamurthy et al., 2021], i.e., the approximation errors do not depend on the actions chosen in the past. This problem assumes that the expected loss functions can be approximated by a linear function. While the approximation error can be any function of the information in the past and the current rounds in general, the corrupted stochastic regime assumes that the approximation error is an affine function of the action chosen in the current round. It is an open question whether the proposed algorithm can obtain a regret upper bound similar to the known regret bounds for this problem when the approximation error can be non-linear.

Appendix B Pseudocode of the proposed algorithm

Appendix C Proof of Lemma 4

For the convex function $\psi$ and $x\in\mathrm{dom}(\psi)$ , denote the Newton decrement at point $x$ by $\lambda(x,\psi)$ , i.e., $\lambda(x,\psi)=\|\nabla\psi(x)\|_{x,\psi}^{*}$ .

Lemma 8 (Theorem 2.2.1 by Nesterov and Nemirovskii [1994]).

Let $\mathcal{S}$ be an open non-empty convex subset of a finite-dimensional real vector space. Let $\psi$ be a self-concordant function on $\mathcal{S}$ and $x\in\mathcal{S}$ . Then, for each $y\in\mathcal{S}$ such that $\|x-y\|_{x,\psi}<1$ , we have

[TABLE]

Lemma 9 ((2.21) by Nemirovski [2004]).

Let $\psi$ be a self-concordant function on $\mathcal{X}$ . If $\lambda(x,\psi)<1$ , we have

[TABLE]

where $x^{*}\in\operatorname*{arg\,min}_{y}\psi(y)$ .

Lemma 10.

Let $\psi$ be a self-concordant function on $\mathcal{X}$ and $x,y\in\mathrm{int}(\mathcal{X})$ . Suppose that $\|x-y\|_{x,\psi}\leq 1/2$ . Then, we have

[TABLE]

for all $\ell\in\mathbb{R}^{d}$ and $\beta>0$ .

Proof.

Using the Cauchy-Schwarz inequality and the AM-GM inequality, we have

[TABLE]

Thus, it is sufficient to show $D_{\psi}(y,x)\geq\frac{1}{8}\|x-y\|_{x,\psi}^{2}$ .

By Taylor’s theorem, we have $D_{\psi}(y,x)=\frac{1}{2}\|x-y\|_{\xi,\psi}^{2}$ for some $\xi=x+\alpha(y-x)$ where $\alpha\in(0,1)$ . It follows from Lemma 8 that

[TABLE]

∎

Proof of Lemma 4

Let $f(y)=D_{\psi}(y,x)-\langle\ell,x-y\rangle/\beta$ . Since $\psi$ is self-concordant, there exists $y^{*}\in\mathrm{int}(\mathcal{X})$ such that $y^{*}\in\operatorname*{arg\,min}_{y\in\mathcal{X}}f(y)$ . If we have $\lambda(x,f)\leq 1/3$ , by Lemma 9, we obtain

[TABLE]

Thus, we obtain

[TABLE]

where the first inequality holds due to $y^{*}\in\operatorname*{arg\,min}_{y\in\mathcal{X}}f(y)$ and the second inequality is derived from Lemma 10. Hence, it suffices to show $\lambda(x,f)\leq 1/3$ . By the definition of $f$ , we have $\nabla f(x)=\ell/\beta$ . Thus, we obtain

[TABLE]

where the inequality is obtained by the assumption. ∎

Appendix D Proof of Lemma 5

We first show that $\operatorname*{\mathbf{E}}\left[a_{t}|x_{t}\right]=x_{t}$ . The expectation of $a_{t}$ is

[TABLE]

where the forth equality follows from $\operatorname*{\mathbf{E}}[v_{t}]=0$ .

Let us next show that $\hat{\ell}_{t}$ defined by (10) is an unbiased estimator of $\ell_{t}$ . We have

[TABLE]

where we used $v_{t}^{2}=1$ , $\operatorname*{\mathbf{E}}[v_{t}]=0$ , and the fact that $v_{t}$ and $m_{t}$ are independent in the fifth equality.

Suppose that $\min_{x\in\mathcal{X}}\psi(x)=0$ holds without loss of generality. Let $x_{0}\in\operatorname*{arg\,min}_{x\in\mathcal{X}}\psi(x)$ . Given $a^{*}\in\mathcal{A}$ , define $x^{*}$ by

[TABLE]

From this, (23) and (24), we have

[TABLE]

Then, as we have $x_{0}+(1-1/T)^{-1}(x^{*}-x_{0})=x_{0}+(a^{*}-x_{0})=a^{*}\in\mathcal{A}$ , we have $\pi_{x_{0}}(x^{*})\leq 1-1/T$ . Hence, from Lemma 3, we have

[TABLE]

From this, (25) and Lemma 1, we have

[TABLE]

The part of $\left\langle\hat{\ell}_{t}-m_{t},x_{t}-x^{\prime}_{t+1}\right\rangle-\beta_{t}D(x^{\prime}_{t+1},x_{t})$ can bounded by using Lemma 4. From the definition (10), we have

[TABLE]

Hence, if $\beta_{t}\geq 6d$ , we have $\|\hat{\ell}_{t}-m_{t}\|_{x_{t},\psi}^{*}\leq\beta_{t}/3$ , and, consequently, we can apply Lemma 4 to bound the stability term as follows:

[TABLE]

where $g_{t}(m)$ is defined in (12). Then, from this and (26), we have

[TABLE]

If $\beta_{t}$ is given by (13), we then have

[TABLE]

which yields

[TABLE]

We also have

[TABLE]

from the definition (13) of $\beta_{t}$ . Combining this with (28) and (29), we obtain

[TABLE]

which completes the proof.

Appendix E Proof of Theorem 2

Fix $\eta\in(0,1/4)$ arbitrarily. By substituting $u_{t}=\bar{\ell}\in\operatorname*{arg\,min}_{\ell}\sum_{t=1}^{T}\|\ell_{t}-{\ell}\|_{2}^{2}$ for all $t$ in (19), we obtain

[TABLE]

Similarly, by substituting $u_{t}=\ell_{t}$ , we obtain

[TABLE]

By combining these with Lemma 5 and applying Jensen’s inequality, we obtain (15). Further, if $f_{t}(a)\geq 0$ , by substituting $u_{t}=0$ , we obtain

[TABLE]

By combining this with Lemma 5, we obtain

[TABLE]

which implies that (16) holds. ∎

Appendix F Proof of Lemma 7

As $\mathcal{X}$ is the convex hull of $\mathcal{A}^{\prime}=\{a^{*}\}\cup\mathrm{conv}(\mathcal{A}\setminus\{a^{*}\})$ , any point $y\in\mathcal{X}$ can be expressed as a convex combination of $a^{*}$ and a point in $\mathrm{conv}(\mathcal{A}\setminus\{a^{*}\})$ , which means that there exists $\lambda\in[0,1]$ and $x^{\prime}\in\mathrm{conv}(\mathcal{A}\setminus\{a^{*}\})$ such that $x=\lambda x^{\prime}+(1-\lambda)a^{*}$ . For such $x$ , we have

[TABLE]

In fact, we have

[TABLE]

which means that (30) holds. We further have

[TABLE]

where the last inequality follows from the fact that $x^{\prime}\in\mathrm{conv}(\mathcal{A}\setminus\{a^{*}\})$ and the definition of $\Delta_{\min}$ . Combining this with (30), we obtain

[TABLE]

We next show

[TABLE]

As $W_{1}(y)$ is an ellipsoid centered at $y$ , it holds that

[TABLE]

We hence have

[TABLE]

where the inequality follows from the fact that $W_{1}(y)\subseteq\mathcal{X}$ . Combining (31) and (32), we obtain

[TABLE]

Appendix G Proof of Theorem 3

From Lemma 6 with $u_{t}=\ell^{*}$ , we have

[TABLE]

From this and Lemma 5, we have

[TABLE]

Under the assumption of (11), we have

[TABLE]

where second inequality follows from $\mathcal{E}_{t}\subseteq W_{1}(x_{t})$ and the last inequality follows from Lemma 7. From this, (33) and $\sigma^{2}=\max_{t\in[T]}\sigma_{t}^{2}$ , we have

[TABLE]

On the other hand, $R_{T}$ is bounded from below as follows:

[TABLE]

Combining this with (34), we obtain

[TABLE]

As $X=O(\sqrt{AX}+B)$ implies $X=O(A+B)$ , we have

[TABLE]

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Abernethy et al. [2008 a] J. Abernethy, E. E. Hazan, and A. Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory, COLT 2008 , pages 263–273, 2008 a.
2Abernethy et al. [2008 b] J. D. Abernethy, E. Hazan, and A. Rakhlin. An efficient algorithm for bandit linear optimization. In 21st Annual Conference on Learning Theory , 2008 b.
3Abernethy et al. [2012] J. D. Abernethy, E. Hazan, and A. Rakhlin. Interior-point methods for full-information and bandit online learning. IEEE Transactions on Information Theory , 58(7):4164–4175, 2012.
4Audibert et al. [2007] J.-Y. Audibert, R. Munos, and C. Szepesvári. Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory: 18th International Conference, ALT 2007, Sendai, Japan, October 1-4, 2007. Proceedings 18 , pages 150–165. Springer, 2007.
5Bogunovic et al. [2021] I. Bogunovic, A. Losalka, A. Krause, and J. Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics , pages 991–999. PMLR, 2021.
6Bubeck and Slivkins [2012] S. Bubeck and A. Slivkins. The best of both worlds: Stochastic and adversarial bandits. In Conference on Learning Theory , pages 42–1. JMLR Workshop and Conference Proceedings, 2012.
7Bubeck et al. [2012] S. Bubeck, N. Cesa-Bianchi, and S. Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory , volume 23, pages 41.1–41.14, 2012.
8Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games . Cambridge university press, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Best-of-Three-Worlds Linear Bandit Algorithm

Abstract

1 Introduction

1.1 Contribution of this work

Theorem 1** (informal).**

Remark 1**.**

1.2 Limitation of this work and future work

2 Preliminary

2.1 Problem setup

Stochastic regime

Adversarial regime

Stochastic regime with adversarial corruption

2.2 Follow the regularized leader

Lemma 1**.**

2.3 Self-concordant barriers

Definition 1**.**

Remark 2**.**

Lemma 2** (Theorem 2.1.2 by Nesterov and Nemirovskii (1994)).**

Lemma 3** (Propositoin 2.3.2 by Nesterov and Nemirovskii (1994)).**

Lemma 4**.**

3 Algorithm

Action ata_{t}at​ and unbiased estimator ℓ^t\hat{\ell}_{t}ℓ^t​ for loss vector

Reference point ztz_{t}zt​

Learning rate parameter βt\beta_{t}βt​

Optimistic prediction mtm_{t}mt​

Computational complexity

4 Analysis

4.1 Regret bounds for the proposed algorithm

Theorem 2** (Regret bounds in the adversarial regime).**

Theorem 3** (Regret bounds in the corrupted stochastic regime).**

Remark 3**.**

4.2 Proof sketch

Lemma 5**.**

Lemma 6**.**

Proof sketch of Theorem 2

Proof sketch of Theorem 3

Lemma 7**.**

Appendix A Related Work

Best-of-Both-Worlds Bandit Algorithms

Adversarial Corruption

Misspecified Linear Contextual Bandits

Appendix B Pseudocode of the proposed algorithm

Appendix C Proof of Lemma 4

Lemma 8** (Theorem 2.2.1 by Nesterov and Nemirovskii [1994]).**

Lemma 9** ((2.21) by Nemirovski [2004]).**

Lemma 10**.**

Proof.

Proof of Lemma 4

Appendix D Proof of Lemma 5

Appendix E Proof of Theorem 2

Appendix F Proof of Lemma 7

Appendix G Proof of Theorem 3

Theorem 1 (informal).

Remark 1.

Lemma 1.

Definition 1.

Remark 2.

Lemma 2 (Theorem 2.1.2 by Nesterov and Nemirovskii (1994)).

Lemma 3 (Propositoin 2.3.2 by Nesterov and Nemirovskii (1994)).

Lemma 4.

Action $a_{t}$ and unbiased estimator $\hat{\ell}_{t}$ for loss vector

Reference point $z_{t}$

Learning rate parameter $\beta_{t}$

Optimistic prediction $m_{t}$

Theorem 2 (Regret bounds in the adversarial regime).

Theorem 3 (Regret bounds in the corrupted stochastic regime).

Remark 3.

Lemma 5.

Lemma 6.

Lemma 7.

Lemma 8 (Theorem 2.2.1 by Nesterov and Nemirovskii [1994]).

Lemma 9 ((2.21) by Nemirovski [2004]).

Lemma 10.