Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed   Bandit Problem

Nadav Merlis; Shie Mannor

arXiv:1905.03125·cs.LG·June 9, 2020

Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

Nadav Merlis, Shie Mannor

PDF

Open Access

TL;DR

This paper introduces a new smoothness criterion for the combinatorial multi-armed bandit problem, enabling tighter regret bounds that are independent of batch size, especially in nonlinear reward settings like probabilistic maximum coverage.

Contribution

The authors propose Gini-weighted smoothness, a novel criterion that improves regret bounds for CMAB algorithms by removing batch-size dependence in nonlinear reward scenarios.

Findings

01

Achieves batch-size independent regret bounds in nonlinear CMAB problems.

02

Provides dramatic improvements in upper bounds for probabilistic maximum coverage.

03

Proves matching lower bounds, demonstrating tightness of the proposed algorithm.

Abstract

We consider the combinatorial multi-armed bandit (CMAB) problem, where the reward function is nonlinear. In this setting, the agent chooses a batch of arms on each round and receives feedback from each arm of the batch. The reward that the agent aims to maximize is a function of the selected arms and their expectations. In many applications, the reward function is highly nonlinear, and the performance of existing algorithms relies on a global Lipschitz constant to encapsulate the function's nonlinearity. This may lead to loose regret bounds, since by itself, a large gradient does not necessarily cause a large regret, but only in regions where the uncertainty in the reward's parameters is high. To overcome this problem, we introduce a new smoothness criterion, which we term \emph{Gini-weighted smoothness}, that takes into account both the nonlinearity of the reward and concentration…

Equations166

i = 1 \sum L x_{i} (1 - x_{i}) (\frac{\partial f ( A ; x )}{\partial x _{i}})^{2} \leq γ_{g} .

i = 1 \sum L x_{i} (1 - x_{i}) (\frac{\partial f ( A ; x )}{\partial x _{i}})^{2} \leq γ_{g} .

R (T) = α β r_{m a x} T - t = 1 \sum T E [r (A_{t}; p)],

R (T) = α β r_{m a x} T - t = 1 \sum T E [r (A_{t}; p)],

q_{ij} (t) = min ⎩ ⎨ ⎧ 1, \overset{p}{^}_{ij} (t - 1) + \frac{6 V ^ _{ij} ( t - 1 ) lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )} ⎭ ⎬ ⎫ .

q_{ij} (t) = min ⎩ ⎨ ⎧ 1, \overset{p}{^}_{ij} (t - 1) + \frac{6 V ^ _{ij} ( t - 1 ) lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )} ⎭ ⎬ ⎫ .

R (T) \leq

R (T) \leq

+ L Δ_{m a x} (1 + M \frac{2 π ^{2}}{3}) .

R (T)

R (T)

+ 340 γ_{\infty} \overset{ˉ}{M} L ⌈ \frac{lo g K}{1.61} ⌉^{2} lo g T 2 + lo g \frac{Δ _{m a x} T}{340 γ _{\infty} M ˉ L ⌈ \frac{l o g K}{1.61} ⌉ ^{2} lo g T} .

\frac{1}{n} i = 1 \sum n X_{i} \leq p + \frac{2 lo g 1/ δ}{3 n} + \frac{2 Var { X } lo g 1/ δ}{n} .

\frac{1}{n} i = 1 \sum n X_{i} \leq p + \frac{2 lo g 1/ δ}{3 n} + \frac{2 Var { X } lo g 1/ δ}{n} .

E [Y_{i}] = Var {X_{i}} = E X_{i}^{2} - p^{2} \leq 1 \cdot E X_{i} - p^{2} = p (1 - p) .

E [Y_{i}] = Var {X_{i}} = E X_{i}^{2} - p^{2} \leq 1 \cdot E X_{i} - p^{2} = p (1 - p) .

V

V

\leq p (1 - p) + \frac{2 lo g 1/ δ}{3 n} + \frac{2 p ( 1 - p ) lo g 1/ δ}{n}

\leq (*) p (1 - p) + \frac{2 lo g 1/ δ}{3 n} + p (1 - p) + \frac{lo g 1/ δ}{2 n}

= 2 p (1 - p) + \frac{7 lo g 1/ δ}{6 n},

\frac{6 V ^ _{ij} ( t - 1 ) lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )}

\frac{6 V ^ _{ij} ( t - 1 ) lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )}

\leq \frac{12 p _{ij} ( 1 - p _{ij} ) lo g t}{N _{j} ( t - 1 )} + \frac{7 lo g \frac{1}{δ} lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )} .

f (A; x + ϵ) - f (A; x) \leq 32 γ_{g} i \in A \sum u_{i}^{2} + γ_{\infty} i \in A \sum v_{i} .

f (A; x + ϵ) - f (A; x) \leq 32 γ_{g} i \in A \sum u_{i}^{2} + γ_{\infty} i \in A \sum v_{i} .

H_{t}^{p} = ⎩ ⎨ ⎧ \exists i, j : ∣ \overset{p}{^}_{ij} (t - 1) - p_{ij} ∣ > \frac{6 V ^ _{ij} ( t - 1 ) lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )} ⎭ ⎬ ⎫

H_{t}^{p} = ⎩ ⎨ ⎧ \exists i, j : ∣ \overset{p}{^}_{ij} (t - 1) - p_{ij} ∣ > \frac{6 V ^ _{ij} ( t - 1 ) lo g t}{N _{j} ( t - 1 )} + \frac{9 lo g t}{N _{j} ( t - 1 )} ⎭ ⎬ ⎫

H_{t}^{V} = {\exists i, j : \hat{V}_{ij} (t - 1) > 2 p (1 - p) + 3.5 \frac{lo g t}{N _{j} ( t - 1 )}}

c_{t} (A_{t}) = c_{1} j \in A_{t} \sum \frac{lo g t}{N _{j} ( t - 1 )} + c_{2} j \in A_{t} \sum \frac{lo g t}{N _{j} ( t - 1 )}

c_{t} (A_{t}) = c_{1} j \in A_{t} \sum \frac{lo g t}{N _{j} ( t - 1 )} + c_{2} j \in A_{t} \sum \frac{lo g t}{N _{j} ( t - 1 )}

R (T) \leq E [t = L + 1 \sum T Δ_{A_{t}} \mathds 1 {Δ_{A_{t}} \leq \overset{ˉ}{M} c_{t} (A_{t})}] + L Δ_{m a x} (1 + M \frac{2 π ^{2}}{3}) .

R (T) \leq E [t = L + 1 \sum T Δ_{A_{t}} \mathds 1 {Δ_{A_{t}} \leq \overset{ˉ}{M} c_{t} (A_{t})}] + L Δ_{m a x} (1 + M \frac{2 π ^{2}}{3}) .

j \in A_{t} \sum \frac{lo g t}{N _{j} ( t - 1 )} < \frac{K Δ _{A_{t}}^{2} ℓ}{g ( K , Δ _{A_{t}} )} .

j \in A_{t} \sum \frac{lo g t}{N _{j} ( t - 1 )} < \frac{K Δ _{A_{t}}^{2} ℓ}{g ( K , Δ _{A_{t}} )} .

t = L + 1 \sum T

t = L + 1 \sum T

\leq [1728 γ_{g}^{2} \overset{ˉ}{M}^{2} j = 1 \sum L \frac{1}{Δ _{j, m i n}} + 68 γ_{\infty} \overset{ˉ}{M} j = 1 \sum L (1 + lo g \frac{Δ _{j, m a x}}{Δ _{j, m i n}})] (k = 1 \sum k_{0} \frac{a _{k}}{b _{k}}) ℓ lo g T,

t = L + 1 \sum T

t = L + 1 \sum T

\leq [8640 γ_{g}^{2} \overset{ˉ}{M}^{2} j = 1 \sum L \frac{1}{Δ _{j, m i n}} + 340 γ_{\infty} \overset{ˉ}{M} j = 1 \sum L (1 + lo g \frac{Δ _{j, m a x}}{Δ _{j, m i n}})] ⌈ \frac{lo g K}{1.61} ⌉^{2} lo g T .

t \to \infty lim inf \frac{R ( t )}{lo g t} \geq \frac{M ˉ ^{2} ( L - K )}{4Δ} .

t \to \infty lim inf \frac{R ( t )}{lo g t} \geq \frac{M ˉ ^{2} ( L - K )}{4Δ} .

R (T) \geq \frac{1}{20} \overset{ˉ}{M} min {(L - K + 1) T, T} .

R (T) \geq \frac{1}{20} \overset{ˉ}{M} min {(L - K + 1) T, T} .

lim t \to \infty in f \frac{R ( t )}{lo g t} \geq j = 1 \sum L - K \frac{Δ _{j}}{D _{KL} ( Y _{j} , Y _{L - K + 1} )} .

lim t \to \infty in f \frac{R ( t )}{lo g t} \geq j = 1 \sum L - K \frac{Δ _{j}}{D _{KL} ( Y _{j} , Y _{L - K + 1} )} .

D_{KL} (Y_{j}, Y_{L - K + 1}) = kl (p_{j}, p_{L - K + 1}) = kl (\frac{1}{2} - ϵ, \frac{1}{2}) \leq (*) 4 ϵ^{2} = 4 \frac{Δ ^{2}}{M ˉ ^{2}},

D_{KL} (Y_{j}, Y_{L - K + 1}) = kl (p_{j}, p_{L - K + 1}) = kl (\frac{1}{2} - ϵ, \frac{1}{2}) \leq (*) 4 ϵ^{2} = 4 \frac{Δ ^{2}}{M ˉ ^{2}},

lim t \to \infty in f \frac{R ( t )}{lo g t} \geq j = 1 \sum L - K \frac{Δ _{j}}{D _{KL} ( Y _{j} , Y _{L - K + 1} )} \geq j = 1 \sum L - K \frac{Δ}{4 Δ ^{2} / M ˉ ^{2}} = \frac{M ˉ ^{2} ( L - K )}{4Δ} .

lim t \to \infty in f \frac{R ( t )}{lo g t} \geq j = 1 \sum L - K \frac{Δ _{j}}{D _{KL} ( Y _{j} , Y _{L - K + 1} )} \geq j = 1 \sum L - K \frac{Δ}{4 Δ ^{2} / M ˉ ^{2}} = \frac{M ˉ ^{2} ( L - K )}{4Δ} .

R (T) \geq \frac{1}{20} \overset{ˉ}{M} min {(L - K + 1) T, T} .

R (T) \geq \frac{1}{20} \overset{ˉ}{M} min {(L - K + 1) T, T} .

g (z) = \int_{0}^{z} \frac{d y}{y ( 1 - y )}, h (z) = \int_{0}^{z} \frac{d y}{min { y , 1 - y }} .

g (z) = \int_{0}^{z} \frac{d y}{y ( 1 - y )}, h (z) = \int_{0}^{z} \frac{d y}{min { y , 1 - y }} .

h (z) = {2 z 22 - 2 1 - z, z \leq \frac{1}{2}, z \geq \frac{1}{2}

h (z) = {2 z 22 - 2 1 - z, z \leq \frac{1}{2}, z \geq \frac{1}{2}

f (A; x + δ) - f (A; x) = \int_{x}^{x + δ} \nabla f (A; y) \cdot d y = \int_{0}^{1} i \in A \sum \frac{\partial f ( A ; r ( t ))}{\partial x _{i}} r_{i}^{'} (t) d t,

f (A; x + δ) - f (A; x) = \int_{x}^{x + δ} \nabla f (A; y) \cdot d y = \int_{0}^{1} i \in A \sum \frac{\partial f ( A ; r ( t ))}{\partial x _{i}} r_{i}^{'} (t) d t,

r_{i} (t) = g^{- 1} ([g (x_{i} + δ_{i}) - g (x_{i})] t + g (x_{i})),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Machine Learning and Algorithms

Full text

\coltauthor\Name

Nadav Merlis \[email protected]

\addrFaculty of Electrical Engineering, Technion, Israel Institute of Technology. and \NameShie Mannor \[email protected]

\addrFaculty of Electrical Engineering, Technion, Israel Institute of Technology.

Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

Abstract

We consider the combinatorial multi-armed bandit (CMAB) problem, where the reward function is nonlinear. In this setting, the agent chooses a batch of arms on each round and receives feedback from each arm of the batch. The reward that the agent aims to maximize is a function of the selected arms and their expectations. In many applications, the reward function is highly nonlinear, and the performance of existing algorithms relies on a global Lipschitz constant to encapsulate the function’s nonlinearity. This may lead to loose regret bounds, since by itself, a large gradient does not necessarily cause a large regret, but only in regions where the uncertainty in the reward’s parameters is high. To overcome this problem, we introduce a new smoothness criterion, which we term Gini-weighted smoothness, that takes into account both the nonlinearity of the reward and concentration properties of the arms. We show that a linear dependence of the regret in the batch size in existing algorithms can be replaced by this smoothness parameter. This, in turn, leads to much tighter regret bounds when the smoothness parameter is batch-size independent. For example, in the probabilistic maximum coverage (PMC) problem, that has many applications, including influence maximization, diverse recommendations and more, we achieve dramatic improvements in the upper bounds. We also prove matching lower bounds for the PMC problem and show that our algorithm is tight, up to a logarithmic factor in the problem’s parameters.

keywords:

Multi-Armed Bandits, Combinatorial Bandits, Probabilistic Maximum Coverage, Gini-Weighted Smoothness, Empirical Bernstein

1 Introduction

The multi-armed bandit (MAB) problem is one of the most elementary problems in decision making under uncertainty. Under this setting, the agent must choose an action (or arm) on each round, out of $L$ possible actions. It then observes the chosen arm’s reward, which is generated from some fixed unknown distribution, and aims to maximize the average cumulative reward (Robbins, 1952). Equivalently, the agent minimizes its expected cumulative regret, i.e., the difference between the best achievable reward and the agent’s cumulative reward. This framework enables us to understand and control the trade-off between information gathering (‘exploration’) and reward maximization (‘exploitation’), and many of the current reinforcement learning algorithms are based on exploration concepts that originate from MAB (Jaksch et al., 2010; Bellemare et al., 2016; Osband et al., 2013; Gopalan and Mannor, 2015).

One of the active research directions in MAB consists in extending the model to support more complicated feedbacks from the environment and more complex reward functions. An important extension that follows this direction is the combinatorial multi-armed-bandit (CMAB) problem with semi-bandit feedback (Chen et al., 2016a). Instead of choosing a single arm, the agent selects a subset of the arms $A_{t}$ (a ‘batch’), and observes feedback $X_{a}(t)$ from each of the arms $a\in A_{t}$ (‘semi bandit feedback’). The reward can be a general function of the expectations $\mu_{a}=\mathbb{E}\left[X_{a}(t)\right]$ , with the linear function as the most common example: $r(A_{t};\mu)=\sum_{a\in A_{t}}\mu_{a}$ for any batch of size $\lvert A_{t}\rvert\leq K$ .

Another common case that falls under the CMAB framework, and will benefit from the results of this paper, is the bandit version of the probabilistic maximum coverage (PMC) problem. Under this setting, each arm is a random set that may contain some subset of $M$ possible items, according to a fixed probability distribution. On each round, the agent chooses a batch of $K$ sets and aims to maximize the number of items that appear in any of the sets (i.e., the size of the union of the sets). In the bandit setting, we assume that the probabilities that items appear in a set are unknown and aim to maximize the item coverage while concurrently learning the probabilities. Variants of the PMC bandit problem have many practical applications, such as influence maximization (Vaswani et al., 2015), ranked recommendations (Kveton et al., 2015a), wireless channel monitoring (Arora et al., 2011), online advertisement placement (Chen et al., 2016a) and more.

Although existing algorithms offer solutions to the CMAB framework under very general assumptions (Chen et al., 2016a, b), there are several issues that still pose a major challenge in the design and analysis of practical algorithms. Notably, most of the existing algorithms quantify the nonlinearity of the function using a global Lipschitz constant. However, large gradients do not necessarily translate to a large regret, but rather the combined influence of the gradient size and the local parameter uncertainty. More specifically, if there are regions in which tight parameter estimates can be derived, the regret will not be large even if the gradients are large. Conversely, in regions where the parameter uncertainty is large, smaller gradients can still cause a large regret. For example, consider the reward function $r(A;\mu)=\sum_{a\in A}\min\left\{c\mu_{a},1\right\}$ , for $c\gg 1$ and $\mu_{a}\in\left[0,1\right]$ , and its translated version $r^{\prime}(A;\mu)=\sum_{a\in A}\max\left\{\min\left\{c(\mu_{a}-\frac{1}{2},1\right\},0\right\}$ . Despite the fact that both functions have the same Lipschitz constant, the regret in the first problem can be much smaller, since parameters are easier to estimate when they are close to the edge of their domain. Another similar example is the PMC bandit problem, in which the reward is approximately linear when the coverage probabilities are small, but declines exponentially when they are large. Thus, the naïve bound cannot capture the interaction between the gradient size and the parameter uncertainty, which results in loose regret bounds.

In this paper, we aim to utilize this principle for the CMAB framework. To this end, we introduce a new smoothness criterion called Gini-weighted smoothness. This criterion takes into account both the nonlinearity of the reward function and the concentration properties of bounded random variables around the edges of their domain. We then suggest an upper confidence bound (UCB) based strategy (Auer et al., 2002a), but replace the classical Hoeffding-type confidence intervals with ones that depend on the empirical variance of the arms, and are based on the Empirical Bernstein inequality (Audibert et al., 2009). We show that Bernstein-type bounds capture similar properties to the Gini-weighted smoothness, and thus allow us to derive tighter regret bounds. Notably, the linear dependence of the regret bound in the batch size is almost completely removed, except for a logarithmic factor, and the batch size only affects the regret through the Gini-smoothness parameter. In problems in which this parameter is batch-size independent, including the PMC bandit problem, our new bound is tighter by a factor of the batch size $K$ . This is comparable to the best possible improvement due to independence assumption in the linear CMAB problem (Degenne and Perchet, 2016), but without any additional statistical assumptions.

Moreover, we demonstrate the tightness of our regret bounds by proving matching lower bounds for the PMC bandit problem, up to logarithmic factors in the batch size. To do so, we construct an instance of the PMC bandit problem that is equivalent to a classical MAB problem and then analyze the lower bounds of this problem. We also show that in contrast to the linear CMAB problem, the lower bounds do not change even if different sets are independent. To the best of our knowledge, our algorithm is the first to achieve tight regret bounds for the PMC bandit problem.

2 Related Work

The multi-armed bandits’ literature is vast. We thus cover only some aspects of the area, and refer the reader to (Bubeck et al., 2012) and (Lattimore and Szepesvári, 2018) for a comprehensive survey. We employ the Optimism in the Face of Uncertainty (OFU) principle (Lai and Robbins, 1985), which is one of the most fundamental concepts in MAB, and can be found in many known MAB algorithms (e.g., Auer et al. 2002a; Garivier and Cappé 2011). While many algorithms rely on Hoeffding-type concentration bounds to derive an upper confidence bound (UCB) of an arm, a few previous works also apply Bernstein-type bounds and demonstrate superior performance, both in theory and in practice (Audibert et al., 2009; Mukherjee et al., 2018).

The general stochastic combinatorial multi-armed bandit framework was introduced in (Chen et al., 2013). They presented CUCB, an algorithm that uses UCB per arm (or ’base arm’), and then inputs the optimistic value of the arm into a maximization oracle. Many preceding works also fall under the CMAB framework, but mainly focus on a specific reward function (Caro and Gallien, 2007; Gai et al., 2010, 2012; Liu and Zhao, 2012), or work in the adversarial setting (Cesa-Bianchi and Lugosi, 2012). While most algorithms for the CMAB setting follow the OFU principle, a few employ Thompson Sampling (Thompson, 1933), e.g., (Wang and Chen, 2018; Hüyük and Tekin, 2018) for the semi-bandit feedback and (Gopalan et al., 2014) for the full-bandit feedback. In recent years, there have been extensive studies on deriving tighter bounds, but these works mostly address the linear CMAB problem (Kveton et al., 2014, 2015c; Combes et al., 2015b; Degenne and Perchet, 2016). More recently, tighter regret bounds were derived for this framework in (Wang and Chen, 2017), and also allow probabilistically triggered arms. Nevertheless, we show that in our setting, these bounds can be improved by a factor of the batch size $K$ for many problems, e.g., the PMC bandit problem, and are comparable, up to logarithmic factors, otherwise.

Empirical Bernstein Inequality was first used for the CMAB problem in (Gisselbrecht et al., 2015) for linear reward functions, and was later used in (Perrault et al., 2018) for sequential search-and-stop problems. Both works focus on specific reward functions and utilize Bernstein inequality to get variance-dependent regret bounds. In contrast, we analyze general reward functions and exploit the relation between the confidence interval and the reward function to derive tighter regret bounds.

The PMC bandit problem is the bandit version of the maximum coverage problem (Hochbaum, 1996), a well studied subject in computer science with many variants and extensions. The bandit variant is closely related to the influence maximization problem (Vaswani et al., 2015; Carpentier and Valko, 2016; Wen et al., 2017), in which the agent chooses a set of nodes in a graph, that influence other nodes through random edges, and aims to maximize the number of influenced nodes. Another related setting is the cascading bandit problem (e.g., Kveton et al. 2015a, b; Combes et al. 2015a; Lattimore et al. 2018), in which a list of items is sequentially shown to a user until she finds one of them satisfactory. This is equivalent to a coverage problem with a single object, but only partial feedback - the user will not give any feedback about items that appear after the one she liked. In both settings, the focus is very different than ours. In influence maximization, the focus is on the diffusion inside a graph and the graph structure, and in cascading bandits on the partial feedback and the list ordering. They are thus complementary to our framework and could benefit from our results.

3 Preliminaries

We work under the stochastic combinatorial-semi bandits framework, when the reward is the weighted sum of smooth monotonic functions. Assume that there are $L$ arms (‘base arms’), and let $\mathcal{A}\subset 2^{\left[L\right]}$ be the action set, i.e., the collection of possible batches (actions) from which the agent can choose, with $\left[L\right]=\left\{1,\dots,L\right\}$ . Also assume that the size of any batch $A\in\mathcal{A}$ is bounded by $\lvert A\rvert\leq K$ . Denote the reward function of an action $A$ with arm parameters $p$ by $r(A;p)$ and assume that the reward is the weighted sum of $M$ functions, $r(A;p)=\sum_{i=1}^{M}w_{i}r_{i}(A;p_{i})$ , for some fixed weights $\left\{{w_{i}}\right\}_{i=1}^{M},w_{i}\geq 0$ and $p_{i}\in\mathbb{R}^{L}$ . Without loss of generality, we also assume $r_{i}(A;x)\geq 0$ .

The agent interacts with the environment as follows: on each round $t$ , the agent chooses an action $A_{t}\in\mathcal{A}$ . Then, for each arm $j\in A_{t}$ , it observes feedback $X_{ij}(t)\in\left[0,1\right]$ for any $i\in\left[M\right]$ , with mean $\mathbb{E}\left[X_{ij}(t)\right]=p_{ij}$ . For ease of notation, assume that $X_{ij}(t)=0\,$ if $\,j\notin A_{t}$ . We denote the empirical estimators of the parameters $p_{ij}$ by $\hat{p}_{ij}(t)=\frac{1}{N_{j}(t)}\sum_{\tau=1}^{t}X_{ij}(\tau)$ , where $N_{j}\left(t\right)=\sum_{\tau=1}^{t}\mathds{1}\left\{j\in A_{\tau}\right\}$ is the number of times an arm $j$ was chosen up to time $t$ , and $\mathds{1}\left\{\cdot\right\}$ is the indicator function. We also denote the empirical variance by $\hat{V}_{ij}(t)=\frac{1}{N_{j}(t)}\sum_{\tau=1}^{t}X_{ij}^{2}(\tau)-\left(\hat{p}_{ij}(t)\right)^{2}$ . The estimated parameters $\hat{p}_{ij}(t)$ are concentrated around their mean according to the Empirical Bernstein inequality:

Lemma 3.1 (Empirical Bernstein).

(Audibert et al., 2009)*

Let $X_{1},\dots,X_{s}$ be i.i.d random variables taking their values in $[0,1]$ , and let $\mu=\mathbb{E}\left[X_{i}\right]$ be their common expected value. Consider the empirical mean $\bar{X}_{s}$ and variance $V_{s}$ defined respectively by*

$\bar{X}_{s}=\frac{1}{s}\sum_{i=1}^{s}X_{i}\enspace$ * and $\enspace V_{s}=\frac{1}{s}\sum_{i=1}^{s}\left(X_{i}-\bar{X}_{s}\right)^{2}=\frac{1}{s}\left(\sum_{i=1}^{s}X_{s}^{2}\right)-\bar{X}_{s}^{2}$ .*

*Then, for any $s\in\mathbb{N}$ and $x>0$ , it holds that $\Pr\left\{\lvert\bar{X}_{s}-\mu\rvert\geq\sqrt{\frac{2V_{s}x}{s}}+\frac{3x}{s}\right\}\leq 3e^{-x}$ . *

We require the functions $r_{i}(A;p_{i})$ to be monotonic Gini-smooth with smoothness parameters $\gamma_{\infty},\gamma_{g}$ , which we define in the following:

Definition 3.2.

Let $f(A;x):\mathcal{A}\times\left[0,1\right]^{L}\to\mathbb{R}$ be a differentiable function in $x\in\left(0,1\right)^{L}$ and continuous in $x\in\left[0,1\right]^{L}$ , for any $A\in\mathcal{A}$ . The function $f(A;x)$ is said to be monotonic Gini-smooth, with smoothness parameters $\gamma_{\infty}$ and $\gamma_{g}$ , if:

For any $A\in\mathcal{A}$ , the function is monotonically increasing with bounded gradient, i.e., for any $i\in A$ and $x\in\left(0,1\right)^{L}$ , $0\leq\frac{\partial f(A;x)}{\partial x_{i}}\leq\gamma_{\infty}$ . If $i\notin A$ , then $\frac{\partial f(A;x)}{\partial x_{i}}=0$ for all $x\in\left(0,1\right)^{L}$ . 2. 2.

For any $A\in\mathcal{A}$ and $x\in\left(0,1\right)^{L}$ , it holds that

[TABLE]

Throughout the paper, we refer this condition as the Gini-weighted smoothness111The name is motivated by the similarity of the weights to the Gini impurity $\sum_{i}x_{i}(1-x_{i})$ .* of $f(A;x)$ .*

While the first condition is very intuitive, and is equivalent to the standard smoothness requirement (Wang and Chen, 2017), the second condition demands further explanation. Notably, the Gini-smoothness parameter is less sensitive to changes in the reward function when the parameters are close to the edges, i.e., close to [math] or $1$ . It may be observed that in these regions, the variances of the parameters are small, which implies that they are more concentrated around their mean. We will later show that this will allow us to mitigate the effect of large gradients on the regret. For simplicity, we assume that all of the functions $r_{i}(A;p_{i})$ have the same smoothness parameters, but the extension is trivial. We note that if for all $i\in\left[M\right]$ , the functions $r_{i}(A;p_{i})$ are Gini-smooth, then $r(A;p)$ is also Gini-smooth. Nevertheless, explicitly decomposing the reward into a sum of monotonic Gini-smooth functions leads to a slightly tighter regret bound. We believe that this is due to a technical artefact, but nonetheless modify our analysis since many important cases fall under this scenario.

An important example that falls under our model is the Probabilistic Maximum Coverage (PMC) problem. In this setting, each arm is a random set that may contain some subset of $M$ possible items. The agent’s goal is to choose a batch of sets such that as many items appear in the sets (i.e., the union of the sets is maximized). Formally, the functions $r_{i}(A;p)=1-\prod_{j\in A}\left(1-p_{ij}\right)$ are the probabilities that an item $i\in\left[M\right]$ was covered and $r(A;p)$ is the (weighted) expected number of covered items. The smoothness constants for this problem are $\gamma_{\infty}=1$ and $\gamma_{g}=\frac{1}{\sqrt{e}}$ , independently of the batch size. Another example is the logistic function $r_{i}(A;p)=\frac{1}{1+Ce^{-\sum_{j\in A}p_{ij}}}$ , for which the smoothness parameters are equal to $\gamma_{\infty}=\frac{1}{4}$ and $\gamma_{g}=\frac{1}{4}\sqrt{1+\log C}$ , which are batch size independent, provided that $C$ does not depend on $K$ . In the linear case, $r_{i}(A;p)=r(A;p)=\sum_{j\in A}p_{j}$ and the smoothness parameters are $\gamma_{\infty}=1$ and $\gamma_{g}=\sqrt{K/4}$ . In general, we can always bound $\gamma_{g}\leq\frac{1}{2}\sqrt{K}\gamma_{\infty}$ , and we will later see that in this case, our bounds are comparable to existing results that only rely on $\gamma_{\infty}$ .

Similarly to previous work, the performance of an agent is measured according to its regret, i.e., the difference between the reward of an oracle, which acts optimally according to the statistics of the arms, and the cumulative reward of the agent. However, in many problems, it is impractical to achieve the optimal solution even when the problem’s parameters are known. Thus, it is more sensible to compare the performance of an algorithm to the best approximate solution that an oracle can achieve. Denote the optimal action by $A^{*}=\arg\max_{A\in\mathcal{A}}\left\{r(A;p)\right\}$ and its value by $r_{\max}$ . An oracle is called an $(\alpha,\beta)$ approximation oracle if for every parameter set $p$ , with probability $\beta$ , it outputs a solution whose value is at least $\alpha r_{\max}$ . We define the expected approximation regret of an algorithm as:

[TABLE]

where the expectation is over the randomness of the environment and the agent’s actions, through the oracle. As noted by (Chen et al., 2016a), in the linear case, and sometimes when arms are independent, the reward function equals to the expectation of the empirical reward, i.e., $r(A_{t};\mu)=\mathbb{E}\left[r(A_{t};X)\right]$ but unfortunately, it does not necessarily occur when arms are arbitrarily correlated.

We end this section with some notations. Let $\bar{M}=\sum_{i=1}^{M}w_{i}$ . Denote the suboptimality gap of a batch $A_{t}$ by $\Delta_{A_{t}}=\alpha r_{\max}-r\left(A_{t};p\right)$ . The minimal gap of a base arm $j$ is the smallest positive gap of a batch $A$ that contains arm $j$ , namely, $\Delta_{j,\min}=\min_{A\in\mathcal{A},j\in A,\Delta_{A}>0}\Delta_{A}$ and the maximal gap is $\Delta_{j,\max}=\max_{A\in\mathcal{A},j\in A,\Delta_{A}>0}\Delta_{A}$ . Note that for all $A\in\mathcal{A}$ , $\Delta_{A}\leq\max_{j}\Delta_{j,\max}\triangleq\Delta_{\max}$ . We denote by $D_{\mathrm{KL}}(X,Y)$ the Kullback-Leibler (KL) divergence between two random variables $X,Y$ . Also denote by $\mathrm{kl}(p,q)=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}$ , the KL divergence between two Bernoulli random variables of means $p$ and $q$ .

4 Algorithm

We suggest a combinatorial UCB-type algorithm with Bernstein based confidence interval, which we call BC-UCB (Bernstein Combinatorial - Upper Confidence Bound). The UCB index is defined as

[TABLE]

A pseudocode can be found in Algorithm 1. On the first rounds, the agent initially samples batches to make sure that each base arm is sampled at least once. Afterwards, we calculate the UCB index $q$ , and an approximation oracle chooses an action $A_{t}$ such that $r(A_{t};q)\geq\alpha r(A^{*}(q);q)$ with probability $\beta$ . The agent then plays this action and observes feedback for any arm $j\in A_{t}$ . It finally updates the empirical probabilities and variances and continues to the next round.

An example for an approximation oracle that can be used when the reward function is monotonic and submodular 222Let $\Omega$ be a finite set. A set function $f:2^{\Omega}\to\mathbb{R}$ is called submodular if $\forall S,T\!\in\!\Omega,f(S)+f(T)\!\geq\!f(S\cup T)+f(S\cap T)$ . is the greedy oracle, which enjoys an approximation factor of $\alpha=1-\frac{1}{e}$ with probability $\beta=1$ (Nemhauser et al., 1978). First, the oracle initializes $A_{t}(0)=\emptyset$ , and then selects $K$ items sequentially in a greedy manner, i.e. $j_{n}\in\arg\max_{j\notin A_{t}(n-1)}\left\{r\left(A_{t}(n-1)\cup\left\{j\right\};q\right)\right\}$ and $A_{t}(n)=A_{t}(n-1)\cup\left\{j_{n}\right\}$ , with $A_{t}=A_{t}(K)$ .

The first term of the confidence interval resembles the standard UCB term of $\sqrt{\frac{3\log t}{2N_{j}(t-1)}}$ , and is actually always smaller, since the empirical variance of variables in $[0,1]$ is always bounded by $\frac{1}{4}$ .333 $V=\frac{1}{n}\sum_{i}X_{i}^{2}-\left(\frac{1}{n}\sum_{i}X_{i}\right)^{2}\leq\frac{1}{n}\sum_{i}X_{i}-\left(\frac{1}{n}\sum_{i}X_{i}\right)^{2}=\left(\frac{1}{n}\sum_{i}X_{i}\right)\left(1-\frac{1}{n}\sum_{i}X_{i}\right)\leq\frac{1}{4}$ . The second term does not appear in the standard UCB bound, and can slightly affect the regret, since suboptimal arms will be asymptotically sampled $O\left(\log t\right)$ times, so the two terms are comparable. Nevertheless, the first term is still dominant, and if the variance of an arm is drastically lower than $\frac{1}{4}$ , the confidence bound is significantly tighter. This can happen, for example, when the arm’s mean is close to [math] or $1$ .

We take advantage of this property through the smoothness parameter $\gamma_{g}$ , that only takes into account the sensitivity of the function to parameter changes when the parameters are far away from [math] or $1$ . This will allow us to derive a drastically tighter regret bound, in comparison to existing algorithms, when $\gamma_{g}=o(\sqrt{K}\gamma_{\infty})$ , as we establish in the following theorem:

Theorem 4.1

Let $r_{i}(A;p_{i})$ be monotonic Gini-smooth reward functions with smoothness parameters $\gamma_{\infty}$ and $\gamma_{g}$ , and let $r(A;p)=\sum_{i=1}^{M}w_{i}r_{i}(A;p_{i})$ be the reward function. For any $T\geq 1$ , the expected approximation regret of BC-UCB with ( $\alpha$ , $\beta$ )-approximation oracle is bounded by

[TABLE]

We can also exploit the problem dependent regret bound to derive a problem independent bound, that is, a bound that holds for any gaps $\Delta_{j,\min}$ :

Corollary 4.2

Let $r_{i}(A;p_{i})$ be monotonic Gini-smooth reward functions with smoothnsess parameters $\gamma_{\infty}$ and $\gamma_{g}$ , and let $r(A;p)=\sum_{i=1}^{M}w_{i}r_{i}(A;p_{i})$ be the reward function. For any $T\geq 1$ , the expected approximation regret of BC-UCB with ( $\alpha$ , $\beta$ )-approximation oracle can be bounded by

[TABLE]

The proof of Theorem 4.1 is presented in the following section, along with a proof sketch for Corollary 4.2. The full proof of the corollary can be found in Appendix E.

We start by noting that we could avoid decomposing the reward into sum of $M$ functions, but the regret bound in this case is slightly looser - the $\bar{M}^{2}$ factor in (4.1) is replaced by the larger factor $M\sum_{i}w_{i}^{2}$ , and the $\log K$ factor is replaced by $\log KM$ . We believe that the logarithmic factor $\log K$ is due to a technical artefact, but leave its removal for future work. We also remark that the second term of the problem dependent regret bounds is negligible for small gaps and can always be bounded using the identity $1+\log x\leq x$ , which yields a regret of $O\left(\left(\bar{M}^{2}\gamma_{g}^{2}+\bar{M}\gamma_{\infty}\Delta_{\max}\right)\left\lceil\frac{\log K}{1.61}\right\rceil^{2}\sum_{j=1}^{L}\frac{\log T}{\Delta_{j,\min}}\right)$ .

To the best of our knowledge, the closest bound to ours appears in (Wang and Chen, 2017). From their perspective, our bandit problem has $ML$ base arms and a batch size of $MK$ with gaps $\Delta_{ij,\min}=\Delta_{j,\min}$ . Their $L_{1}$ smoothness parameter equals $B=\gamma_{\infty}\max_{i}w_{i}$ . Substituting into their bound yields a regret of $O\left(\sum_{i,j}\frac{\left(\gamma_{\infty}\max_{i}w_{i}\right)^{2}MK\log T}{\Delta_{ij,\min}}\right)=O\left(\gamma_{\infty}^{2}K\sum_{j}\frac{\left(\max_{i}w_{i}\right)^{2}M^{2}\log T}{\Delta_{j,\min}}\right)$ . Since $\gamma_{g}\leq\frac{1}{2}\sqrt{K}\gamma_{\infty}$ , $\Delta_{\max}\leq\bar{M}K\gamma_{\infty}$ and $\bar{M}\leq M\max_{i}w_{i}$ , our regret is tighter when $\gamma_{g}$ is not trivial (that is, $\gamma_{g}=o(\sqrt{K}\gamma_{\infty})$ ), up to a factor of $\log^{2}K$ .

Alternatively to our approach, it is possible to analyze BC-UCB only based on the $\gamma_{\infty}$ smoothness. In this case, the analysis will be very similar to that of Kveton et al. (2015c). This will yield a dominant term that does not depend on the logarithmic factor $\log K$ , and declines with the variance of the arms. However, it can lead to dramatically worse bounds when $\gamma_{g}$ is small, so we decided not to pursue this path. Nonetheless, this approach is still worth mentioning when comparing to other algorithms, since when its bounds are combined with ours, we can conclude that the regret of BC-UCB is always tighter than the regret obtained in (Wang and Chen, 2017).

On a final note, we return to the PMC bandit problem. In this case, $\gamma_{g}$ and $\gamma_{\infty}$ are $O(1)$ and $\Delta_{\max}$ is $O(\bar{M})$ , so the regret is $O\left(\bar{M}^{2}\log^{2}K\sum_{j=1}^{L}\frac{\log T}{\Delta_{j,\min}}\right)$ , which is tighter by a factor of $K$ from existing results (Wang and Chen, 2017). We will later show that this bound is tight, up to logarithmic factors in $K$ .

5 Proving the Regret Upper Bounds

We start the proof by simplifying the first term of the UCB index. To this end, recall Bernstein’s inequality:

Lemma 5.1 (Bernstein’s Inequality).

Let $\left\{X_{i}\right\}_{i=1}^{n}$ be independent random variables in $\left[0,1\right]$ with mean $\mathbb{E}\left[X_{i}\right]=p$ and variance $\mathrm{Var}\left\{X\right\}$ . Then, with probability $1-\delta$ :

[TABLE]

Next, let $V=\frac{1}{n}\sum_{i=1}^{n}\left(X_{i}-\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}\right)\right)^{2}$ be the empirical variance of independent random variables $X_{i}\in\left[0,1\right]$ with mean $p$ , and note that $V=\frac{1}{n}\sum_{i=1}^{n}\left(X_{i}-p\right)^{2}-\left(\frac{1}{n}\sum_{i=1}^{n}X_{i}-p\right)^{2}\leq\frac{1}{n}\sum_{i=1}^{n}\left(X_{i}-p\right)^{2}$ . We can thus define the independent random variables $Y_{i}=\left(X_{i}-p\right)^{2}$ and bound $\frac{1}{n}\sum_{i=1}^{n}Y_{i}$ instead. It is clear that $0\leq Y_{i}\leq 1$ , and their expectations can also be bounded by

[TABLE]

The variance of $Y_{i}$ can be similarly bounded by $\mathrm{Var}\left\{Y_{i}\right\}\leq\mathbb{E}\left[Y_{i}^{2}\right]\leq 1\cdot\mathbb{E}\left[Y_{i}\right]\leq p(1-p)$ . Applying Bernstein’s Inequality (6) on $Y_{i}$ can now give a high probability bound on the empirical variance: with probability $1-\delta$ ,

[TABLE]

where $(*)$ utilizes the relation $ab\leq\frac{1}{2}\left(a^{2}+b^{2}\right)$ with $a=\sqrt{2p(1-p)}$ and $b=\sqrt{\frac{\log 1/\delta}{n}}$ . An important (informal) conclusion of inequality (5), is that if the event under which the inequality holds for $\hat{V}_{ij}$ does occur, then the confidence interval around $\hat{p}_{ij}(t-1)$ can be bounded by:

[TABLE]

For $\delta\approx t^{-3}$ , the confidence interval is of the form $u_{i}\sqrt{p_{i}(1-p_{i})}+v_{i}$ , for $u_{i}=O\left(\sqrt{\frac{\log t}{N_{j}(t-1)}}\right)$ and $v_{i}=O\left(\frac{\log t}{N_{j}(t-1)}\right)$ . We should therefore analyze how this kind of parameter perturbations affect the reward function. To do so, we take advantage of the Gini-weighted smoothness, as stated in the following lemma (see Appendix A for proof):

Lemma 5.2

Let $f(A;x)$ be a monotonic Gini-smooth function, with smoothness parameters $\gamma_{\infty}$ and $\gamma_{g}$ . Also let $u,v\in\mathbb{R}^{L}$ be some constant vectors such that $u_{i}\geq 0$ and $v_{i}\geq 0$ .

For any $x,\epsilon\in\left[0,1\right]^{L}$ such that $\epsilon_{i}\leq\min\left\{u_{i}\sqrt{x_{i}(1-x_{i})}+v_{i},1-x_{i}\right\}$ , the sensitivity of $f(A;x)$ to parameter change can be bounded by

[TABLE]

Next, define the low probability events under which some of the variables are not concentrated in their confidence intervals:

[TABLE]

Also denote $\mathbb{H}_{t}=\mathbb{H}_{t}^{V}\cup\mathbb{H}_{t}^{p}$ . Intuitively, even though the regret may be large under $\mathbb{H}_{t}$ , the event cannot occur many times, and we can therefore analyze the regret under the assumption that $\mathbb{H}_{t}$ does not occur. We can then bound the regret similarly to inequality (5), combined with Lemma 5.2. Formally, we decompose the regret as follows:

Lemma 5.3

Let $r_{i}(A;p)$ be Gini-smooth functions with parameters $\gamma_{g}$ , $\gamma_{\infty}$ , and define

[TABLE]

for $c_{1}=12\sqrt{6}\gamma_{g}$ and $c_{2}=34\gamma_{\infty}$ . The regret of Algorithm 1, when used with $(\alpha,\beta)$ approximation oracle, can be bounded by

[TABLE]

The proof is in Appendix B. It is interesting to observe that $c_{t}(A_{t})$ is very similar in its form to the confidence interval for the linear combinatorial problem when parameters are independent (Combes et al., 2015b). We have achieved this form of confidence without any independence assumptions, only on the basis of the properties of the reward function. We can therefore adapt the proofs of (Kveton et al., 2015c; Degenne and Perchet, 2016) to derive a problem dependent regret bound. Since we are interested in bounding the regret of the first term, and due to the initial sampling stage, we assume from this point onward that all of the arms were sampled at least once.

Define $\left\{a_{k}\right\}_{k=0}^{\infty}$ and $\left\{b_{k}\right\}_{k=0}^{\infty}$ , two positive decreasing sequences that converge to [math] and will be determined later, with $b_{0}=1$ . Also define the set $S_{t}^{k}=\left\{j\in A_{t}:N_{j}(t-1)\leq a_{k}\frac{g(K,\Delta_{A_{t}})\log t}{\Delta_{A_{t}}^{2}}\right\}$ for some function $g(K,\Delta_{A_{t}})$ that will also be determined later, with $S_{t}^{0}=A_{t}$ . Intuitively, $S_{t}^{k}$ is the set of arms that were chosen on round $t$ and were not sampled enough times. Denote the events in which $S_{t}^{k}$ contains at least $Kb_{k}$ elements, but $S_{t}^{n}$ contain less than $Kb_{n}$ for any $n<k$ , by $\mathbb{G}_{t}^{k}=\left\{\left\{\left\lvert S_{t}^{k}\right\rvert\geq Kb_{k}\right\}\cap\left\{\forall n<k,\left\lvert S_{t}^{n}\right\rvert<Kb_{n}\right\}\right\}$ , and let $\mathbb{G}_{t}=\cup_{k=1}^{\infty}\mathbb{G}_{t}^{k}$ . We show that when $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ , then $\mathbb{G}_{t}$ must occur, for the appropriate $g(K,\Delta_{A_{t}})$ . To do so, we first cite a variant of Lemmas 7 and 8 of (Degenne and Perchet, 2016), with $e=1$ , $\Gamma^{(ii)}=1$ and $f(t)=\log t/8$ :

Lemma 5.4.

Define $\ell=\left(\sum_{k=1}^{k_{0}}\frac{b_{k-1}-b_{k}}{a_{k}}+\frac{b_{k_{0}}}{a_{k_{0}}}\right)$ , and let $k_{0}$ be the smallest index such that $b_{k_{0}}\leq 1/K$ . Then $\mathbb{G}_{t}=\cup_{k=1}^{k_{0}}\mathbb{G}_{t}^{k}$ . Also, under $\bar{\mathbb{G}}_{t}$ , the following inequality holds:

[TABLE]

Using this lemma, we can now prove that $\mathbb{G}_{t}$ must occur (see Appendix C for proof):

Lemma 5.5

If $g(K,\Delta_{A_{t}})=\left(864\gamma_{g}^{2}+68\frac{\gamma_{\infty}\Delta_{A_{t}}}{\bar{M}}\right)\bar{M}^{2}K\ell$ , and if $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ , then $\mathbb{G}_{t}$ occurs.

A direct result of this lemma is that if $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ , at least one event $\left\{\mathbb{G}_{t}^{k}\right\}_{k=1}^{k_{0}}$ occurs. This allows us to further decompose the first term of (13), and achieve the final result of Theorem 4.1:

Lemma 5.6

The regret from the event $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ can be bounded by

[TABLE]

and if $a_{k}=b_{k}=0.2^{k}$ ,

[TABLE]

The proof is in Appendix D. Substituting (5.6) into (13) concludes the proof of Theorem 4.1. $\hfill\blacksquare$

The proof for the problem independent upper bound of Corollary 4.2 is a direct result of Lemma 5.6. Specifically, the bound can be achieved by decomposing the regret according to Lemma 5.3, and then dividing the regret into large gaps ( $\Delta_{A_{t}}\geq\Delta$ ) and small gaps ( $\Delta_{A_{t}}\leq\Delta$ ), according to some fixed threshold $\Delta$ . Large gaps are bounded according to Lemma 5.6 with $\Delta_{j,\min}\geq\Delta$ , and small gaps are bounded trivially by $\Delta T$ . The final bound is achieved by optimizing the threshold $\Delta$ . The full proof can be found in Appendix E.

6 Lower Bounds

Although our algorithm enjoys improved upper bounds in comparison to CUCB, it is still interesting to see whether our results are tight in problems where previous bounds are loose. To demonstrate the tightness of our algorithm, we present an instance of the PMC bandit problem, on which our results are tight up to logarithmic factors. We assume throughout the rest of this section that the maximization oracle can output the optimal batch, i.e. has an approximation factor of $\alpha=1$ , with probability $\beta=1$ . This assumption allows us to focus on the difficulty of the problem due to parameter uncertainty and the semi-bandit feedback. We formally state the results in the following proposition:

Proposition 6.1.

There exist an instance of the PMC bandit problem with minimal gap $\Delta$ such that the expected regret of any consistent algorithm444An algorithm is called consistent if for any problem and any $\alpha>0$ , the regret of the algorithm is $R(t)=o(t^{\alpha})$ as $t\to\infty$ . is bounded by

[TABLE]

Moreover, for any $T>0$ and $L>K$ , there exist an instance of the PMC bandit problem such that the expected regret of any algorithm is bounded by

[TABLE]

Proof 6.2.

Consider the following PMC bandit problem: fix the first $K-1$ arms to be empty sets, that is, $X_{ij}=0$ for any $(i,j)\in\left[M\right]\times\left[K-1\right]$ , which also implies $p_{ij}=0$ . For the rest of the arms $j\in\left\{K,\dots,L\right\}$ , we force all of the items to be identically distributed, i.e., $X_{ij}=X_{j}$ and $p_{ij}=p_{j}$ . We also fix the action set to be $\mathcal{A}=\left\{A_{j}\right\}_{j=1}^{L-K+1}$ , where $A_{j}=\left\{1,\dots,K-1,j+K-1\right\}$ contains all of the $K-1$ arms with zero reward plus an additional arm $j+K-1$ . The expected reward when choosing an action $A_{j}$ is thus $\bar{M}p_{j}\triangleq\mu_{j}$ , and the problem is equivalent to a $\left(L-K+1\right)$ -armed bandit problem with arm distribution $Y_{j}=\sum_{i}w_{i}X_{j+K-1}=\bar{M}X_{j+K-1}$ . In order to prove the problem dependent regret bound, let $p_{j}=\frac{1}{2}-\epsilon,\forall j\in\left\{K,\dots,L-1\right\}$ and $p_{L}=\frac{1}{2}$ , or equivalently, $\mu_{j}=\frac{\bar{M}}{2}-\bar{M}\epsilon,\forall j\in\left[L-K\right]$ and $\mu_{L-K+1}=\frac{\bar{M}}{2}$ . The gaps of the problem are thus $\Delta_{j}=\bar{M}\epsilon\triangleq\Delta$ for any arm $j\in\left[L-K\right]$ . For any consistent MAB algorithm, the expected regret of the algorithm can be lower bounded by (Lai and Robbins, 1985):

[TABLE]

The KL divergence can be directly bounded by

[TABLE]

where $(*)$ is due to the relation $\mathrm{kl}(p,q)\leq\frac{\left(p-q\right)^{2}}{q(1-q)}$ (Csiszár and Talata, 2006). Substituting back into (19) yields the first part of the proposition:

[TABLE]

Due to the fact that all of the items have the same distribution, our problem is equivalent to an $(L-K+1)$ MAB problem with arms in $[0,1]$ scaled by a factor of $\bar{M}$ . Thus, the problem independent lower bound from (Auer et al., 2002b) can also be applied to this problem, and is scaled by the same factor $\bar{M}$ :

[TABLE]

We remark that throughout the proof, we assumed nothing about the correlation between different arms, and thus the bound cannot be improved by assuming this kind of independence. Nevertheless, if items are assumed to be independent, the lower bounds can be drastically improved. We will not tackle the problem of independent items in this paper, but leave it to future work.

7 Summary

In this work, we introduced BC-UCB, a CMAB algorithm that utilizes Bernstein-type confidence intervals. We defined a new smoothness criterion called Gini-weighted smoothness, and showed that it allows us to derive tighter regret bounds in many interesting problems. We also presented matching lower bounds for such a problem, and thus demonstrated the tightness of our algorithm.

We believe that our concepts can be applied to derive tighter bounds in many interesting settings. Specifically, our analysis includes the PMC bandit problem, that has a central place in the areas of ranked recommendations and influence maximization. We also believe that our results could be extended to the frameworks of cascading bandits and probabilistically triggered arms, but leave this for future work.

Another possible direction involves analyzing specific arm distributions - in our framework, we assumed nothing about the arms’ distribution except for its domain, and thus could only take into account very weak concentration properties, and specifically the concentration properties around the edges of the domain. If additional information about the distribution of the arms is present, it should be possible to leverage such information to design more sophisticated smoothness criteria. Such criteria could take into account tighter concentration properties of the arms’ distribution, and thus lead to tighter regret bounds.

Finally, we remark that the lower bounds were possible to derive only since we have required the algorithm to support any arbitrary choice of action set $\mathcal{A}$ . For the PMC problem, previous work shows that when $\mathcal{A}$ contains any subset of fixed size $K$ , the regret bounds can be significantly improved (Kveton et al., 2015a). It is interesting to see if our technique can be used in this setting to extend these results and derive tighter bounds for any Gini-smooth function.

\acks

The authors thank Asaf Cassel and Esther Derman for their helpful comments on the manuscript.

Appendix A Proof of Lemma 5.2

See 5.2

Proof A.1.

First, we define the functions

[TABLE]

We note that $g(z)$ is well defined for $z\in[0,1]$ , since the function $1/\sqrt{y}$ is integrable near $y=0$ and $1/\sqrt{1-y}$ is continuous, so the product $\frac{1}{\sqrt{y(1-y)}}$ is integrable near $z=0$ . Symmetrically, the function is also integrable near $y=1$ . $h(z)$ can be explicitly written as

[TABLE]

The two functions are closely related: observe that $h^{\prime}(z)\leq g^{\prime}(z)\leq\sqrt{2}h^{\prime}(z)$ with $g(0)=h(0)=0$ , and thus $h(z)\leq g(z)\leq\sqrt{2}h(z)$ . In addition, $g^{\prime}(z)>0$ , and therefore the function is strictly monotonically increasing, so its inverse $g^{-1}$ is well defined. Finally, the relation between the derivatives also yields the property $g(z_{2})-g(z_{1})\leq\sqrt{2}\left(h(z_{2})-h(z_{1})\right)$ for any $z_{1}\leq z_{2}$ in $[0,1]$ .

Next, we bound $f(A;x+\delta)-f(A;x)$ , for any $\delta$ such that $\delta_{i}\leq\min\left\{u_{i}\sqrt{x_{i}(1-x_{i})},1-x_{i}\right\}$ . The bound can be achieved using the gradient theorem:

[TABLE]

for a parameterization $r(t)$ such that $r_{i}(0)=x_{i}$ and $r_{i}(1)=x_{i}+\delta_{i}$ . Specifically, we choose the parameterization to be

[TABLE]

and thus its gradient is

[TABLE]

Substituting back into (22) yields

[TABLE]

The first inequality is due to Cauchy Schwarz and the second uses the definition of the Gini-weighted smoothness (1). All that’s left is bounding the differences between different values of $g$ . Assume w.l.o.g. that $x_{i}\neq 0,1$ , since otherwise $\delta_{i}=0$ and the difference is [math]. To calculate the bound, we exploit the relation between $g$ and $h$ , and calculate differences over $h$ in three different cases:

•

If $x_{i},x_{i}+\delta_{i}\leq\frac{1}{2}$ , then

[TABLE]

where the first inequality is since $\sqrt{1+a}\leq 1+\frac{a}{2}$ and the second is due to $\delta_{i}\leq u_{i}\sqrt{x_{i}}$ .

•

If $x_{i},x_{i}+\delta_{i}\geq\frac{1}{2}$ , then

[TABLE]

where the first inequality is due to $1-\sqrt{1-a}\leq a$ for $a\in[0,1]$ and the second is since $\delta_{i}\leq u_{i}\sqrt{1-x_{i}}$ .

•

If $x_{i}\leq\frac{1}{2}$ and $x_{i}+\delta_{i}\geq\frac{1}{2}$ , then

[TABLE]

$(1)$ * uses the bound calculated for the first two cases and $(2)$ takes advantage of the relation $x_{i}\leq\frac{1}{2}$ for the first term and $x_{i}+\delta_{i}\geq\frac{1}{2}$ for the second one.*

To summarize, for any $x_{i},x_{i}+\delta_{i}\in[0,1]$ , we have

[TABLE]

Substituting back into (23) leads to the bound

[TABLE]

Next, we aim to bound $f(A;x+\epsilon)-f(A;x+\delta)$ . Fortunately, this term can be easily bounded using the gradient theorem as follows:

[TABLE]

and combining (24) and (A.1) concludes the proof.

Appendix B Proof of Lemma 5.3

See 5.3

Proof B.1.

By the definition of the expected regret,

[TABLE]

where $\mathbb{H}_{t}=\mathbb{H}_{t}^{p}\cup\mathbb{H}_{t}^{V}$ is the event in which the concentration bounds fail to hold at time $t$ and $\mathbb{H}_{t}^{p},\mathbb{H}_{t}^{V}$ are defined in Equations (10) and (11). $\bar{\mathbb{H}}_{t}$ is the complementary event to $\mathbb{H}_{t}$ . The first term in (B.1) is the regret due to the initial sampling, that forces sampling each base arm at least once, on which we apply the bound $\Delta_{A}\leq\Delta_{\max}$ . For the rest of the terms, the initial sampling stage allows us to assume that all of the arms were sampled at least once. The second term can be bounded using Empirical Bernstein (Lemma 3.1) and Bernstein inequality (5) as follows:

[TABLE]

Inequalities $(1)-(3)$ are simple union bounds: $(1)$ is a union bound on $\mathbb{H}_{t}^{p}$ and $\mathbb{H}_{t}^{V}$ and on different times. $(2)$ is on different functions $i\in\left[M\right]$ and arms $j\in\left[L\right]$ . $(3)$ is a union bound on different values of $N_{j}(t-1)$ , ranging in $s\in\left[t\right]$ . $(4)$ uses Empirical Bernstein (Lemma 3.1) with $x=3\log t$ for $\hat{p}_{ij}(t-1)$ and Bernstein’s inequality (5) for $\hat{V}_{ij}(t-1)$ with $\delta=t^{-3}$ .

Next, we bound the difference $\Delta_{A_{t}}$ under the event $\bar{\mathbb{H}}_{t}$ . Note that under $\bar{\mathbb{H}}_{t}^{p}$ , and for any $i,j$ , it holds that $p_{ij}\leq q_{ij}(t)$ and

[TABLE]

In addition, since $\bar{\mathbb{H}}_{t}^{V}$ also occurs, we can further bound $q_{ij}$ by

[TABLE]

where we used the inequality $\sqrt{x+y}\leq\sqrt{x}+\sqrt{y}$ .

Notice that $q_{ij}(t)$ is of the form $q_{ij}(t)-p_{ij}\leq u_{j}\sqrt{p_{ij}(1-p_{ij})}+v_{j}$ , for $u_{j}=4\sqrt{3}\sqrt{\frac{\log t}{N_{j}(t-1)}}$ and $v_{j}=34\frac{\log t}{N_{j}(t-1)}$ . We can thus apply Lemma 5.2 to bound the difference between the optimistic reward with the UCB index $q$ and the real reward parameters $p$ . Combined with the properties of the approximation oracle, it is possible to estimate the error between the optimal action and the selected action, under the assumption that the oracle succeeded in his approximation (an event which we denote by $\mathcal{F}_{t}$ )

[TABLE]

where $(1)$ is due to the monotonicity of $r$ and $p\leq q$ , $(2)$ is from the properties of the approximation oracle and $(3)$ applies Lemma 5.2. Consequentially, the sub-optimality gap of action $A_{t}$ under the events $\bar{\mathbb{H}}_{t}$ and $\mathcal{F}_{t}$ can be bounded by

[TABLE]

which in turn, allows us to bound the third term of (B.1) by

[TABLE]

Substituting equations (B.1) and (B.1) into (B.1) leads to the desired result:

[TABLE]

Appendix C Proof of Lemma 5.5

See 5.5

Proof C.1.

Assume in contradiction that both $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ and $\bar{\mathbb{G}}_{t}$ occur. From the definition of $c_{t}(A_{t})$ and Lemma 5.4, we get

[TABLE]

where $(*)$ is due to $\sqrt{a(a+b)}\leq a+\frac{b}{2}$ . We got $\Delta_{A_{t}}<\Delta_{A_{t}}$ , which is a contradiction, and therefore, if $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ , then $\bar{\mathbb{G}}_{t}$ cannot occur.

Appendix D Proof of Lemma 5.6

See 5.6

Proof D.1.

Recall that $\mathbb{G}_{t}^{k}=\left\{\left\{\left\lvert S_{t}^{k}\right\rvert\geq Kb_{k}\right\}\cap\left\{\forall n<k,\left\lvert S_{t}^{n}\right\rvert<Kb_{n}\right\}\right\}$ . Specifically, if $\mathbb{G}_{t}^{k}$ occurs, then at least $Kb_{k}$ arms were sampled less than $a_{k}\frac{g(K,\Delta_{A_{t}})\log t}{\Delta_{A_{t}}^{2}}$ times. Denote the sub-event of $\mathbb{G}_{t}^{k}$ in which $j$ is one of these arms by $\mathbb{G}_{t}^{k,j}=\mathbb{G}_{t}^{k}\cap\left\{j\in A_{t},N_{j}(t-1)\leq a_{k}\frac{g(K,\Delta_{A_{t}})\log t}{\Delta_{A_{t}}^{2}}\right\}$ . Therefore, if $\mathbb{G}_{t}^{k}$ occurs, there are at least $Kb_{k}$ arms for which $\mathbb{G}_{t}^{k,j}$ occur, and

[TABLE]

Next, we apply Lemma 5.5, and bound the regret from the event $\Delta_{A_{t}}\leq\bar{M}c_{t}(A_{t})$ by

[TABLE]

Denote the number of possible positive gaps of batches that contain arm $j$ by $D_{j}$ , and assume $\Delta_{j,1}>,\dots>\Delta_{j,D_{j}}=\Delta_{j,\min}>0$ with $\Delta_{j,0}=\infty$ . We decompose the regret as

[TABLE]

Denote $\theta_{k}=864\gamma_{g}^{2}\bar{M}^{2}K\ell a_{k}\log T$ and $\mu_{k}=68\gamma_{\infty}\bar{M}K\ell a_{k}\log T$ . Under these notations

[TABLE]

We added numbering through the first lines, as changes between consecutive lines can be hard to discern. In $(1)$ , we bounded $a_{k}g(K,\Delta_{A_{t}})\log t\leq\theta_{k}+\mu_{k}\Delta_{A_{t}}$ . In $(2)$ , we divided the interval $N_{j}(t-1)\leq\frac{\theta_{k}}{\Delta_{j,n}^{2}}$ into non overlapping sub-interval and used union bound. Next, we replaced $\Delta_{j,n}$ by larger $\Delta_{j,p}$ in $(3)$ , and extended the internal sum to $D_{j}$ in $(4)$ . Finally, in $(5)$ we replaced the sum over specific positive gaps by the event $\Delta_{A_{t}}>0$ and in $(6)$ we bounded the maximal number of times that each of the indicators can be nonzero by he length of the interval. The rest of the lines involve reordering and bounding a summation by an integral. Substituting back into (D.1) gives the first desired result:

[TABLE]

Similarly to appendix C of (Degenne and Perchet, 2016), choosing $a_{k}=b_{k}=b^{k}$ for $b=0.2$

[TABLE]

which concludes the second part of the results.

Appendix E Proof of Corollary 4.2

See 4.2

Proof E.1.

We start from Lemma 5.3, but divide the first term of the regret into large gaps $\Delta_{A_{t}}\geq\Delta$ and small gaps $\Delta_{A_{t}}\leq\Delta$ . We then use Lemma 5.6 only for the large gaps, and bound the regret of the smaller gaps by the trivial bound $\Delta T$ :

[TABLE]

Next, set $\Delta=\sqrt{\frac{u_{1}\log T}{T}}+\frac{u_{2}\log T}{T}$ , for $u_{1}=8640\gamma_{g}^{2}\bar{M}^{2}L\left\lceil\frac{\log K}{1.61}\right\rceil^{2}$ and $u_{2}=340\gamma_{\infty}\bar{M}L\left\lceil\frac{\log K}{1.61}\right\rceil^{2}$ . Substituting this value gives the desired results:

[TABLE]

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arora et al. (2011) Pallavi Arora, Csaba Szepesvári, and Rong Zheng. Sequential learning for optimal monitoring of multi-channel wireless networks . IEEE, 2011.
2Audibert et al. (2009) Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvári. Exploration–exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science , 410(19):1876–1902, 2009.
3Auer et al. (2002 a) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2-3):235–256, 2002 a.
4Auer et al. (2002 b) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing , 32(1):48–77, 2002 b.
5Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems , pages 1471–1479, 2016.
6Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning , 5(1):1–122, 2012.
7Caro and Gallien (2007) Felipe Caro and Jérémie Gallien. Dynamic assortment with demand learning for seasonal consumer goods. Management Science , 53(2):276–292, 2007.
8Carpentier and Valko (2016) Alexandra Carpentier and Michal Valko. Revealing graph bandits for maximizing local influence. In International Conference on Artificial Intelligence and Statistics , 2016.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

Abstract

keywords:

1 Introduction

2 Related Work

3 Preliminaries

Lemma 3.1** (Empirical Bernstein).**

Definition 3.2**.**

4 Algorithm

Theorem 4.1

Corollary 4.2

5 Proving the Regret Upper Bounds

Lemma 5.1** (Bernstein’s Inequality).**

Lemma 5.2

Lemma 5.3

Lemma 5.4**.**

Lemma 5.5

Lemma 5.6

6 Lower Bounds

Proposition 6.1**.**

Proof 6.2**.**

7 Summary

Appendix A Proof of Lemma 5.2

Proof A.1**.**

Appendix B Proof of Lemma 5.3

Proof B.1**.**

Appendix C Proof of Lemma 5.5

Proof C.1**.**

Appendix D Proof of Lemma 5.6

Proof D.1**.**

Appendix E Proof of Corollary 4.2

Proof E.1**.**

Lemma 3.1 (Empirical Bernstein).

Definition 3.2.

Lemma 5.1 (Bernstein’s Inequality).

Lemma 5.4.

Proposition 6.1.

Proof 6.2.

Proof A.1.

Proof B.1.

Proof C.1.

Proof D.1.

Proof E.1.