Thompson Sampling for Combinatorial Network Optimization in Unknown Environments
Alihan H\"uy\"uk, Cem Tekin

TL;DR
This paper introduces a Bayesian algorithm called Combinatorial Thompson Sampling (CTS) for solving complex network optimization problems in unknown environments, providing theoretical regret bounds and demonstrating superior empirical performance.
Contribution
It extends combinatorial bandit algorithms to unknown settings using CTS, with new regret bounds and practical advantages over existing UCB-based methods.
Findings
CTS achieves near-optimal regret bounds under Lipschitz conditions.
CTS outperforms UCB-based algorithms by at least an order of magnitude in simulations.
Theoretical analysis covers various reward and triggering probability conditions.
Abstract
Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. We consider a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS). Under the semi-bandit feedback model and assuming access to an oracle without knowing the expected base arm outcomes beforehand, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves $O(\sum_{i =1}^m\log…
| CTS | CUCB | CascadeUCB1 | CascadeKL-UCB | TS-Cascade | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 16 | 2 | 0.15 | 155.4 | 14.1 | 1284.1 | 52.4 | 1300.6 | 46.8 | 360.6 | 23.4 | 381.1 | 16.8 |
| 16 | 4 | 0.15 | 103.2 | 9.0 | 998.9 | 33.2 | 993.6 | 32.8 | 267.3 | 20.6 | 281.0 | 11.8 |
| 16 | 8 | 0.15 | 52.1 | 9.8 | 549.5 | 16.8 | 546.4 | 11.7 | 150.3 | 15.6 | 137.9 | 8.8 |
| 32 | 2 | 0.15 | 321.4 | 18.9 | 2718.8 | 61.2 | 2676.4 | 59.4 | 749.2 | 34.2 | 752.9 | 49.9 |
| 32 | 4 | 0.15 | 252.2 | 17.0 | 2227.0 | 55.4 | 2232.1 | 46.6 | 617.4 | 39.9 | 612.3 | 15.2 |
| 32 | 8 | 0.15 | 155.4 | 25.7 | 1531.0 | 21.9 | 1525.4 | 30.0 | 420.6 | 27.5 | 385.0 | 16.3 |
| 16 | 2 | 0.075 | 276.9 | 50.7 | 2057.6 | 79.6 | 2065.4 | 87.4 | 709.0 | 60.4 | 688.3 | 78.5 |
| 16 | 4 | 0.075 | 205.4 | 25.7 | 1496.5 | 65.2 | 1512.4 | 87.0 | 546.3 | 53.5 | 557.9 | 45.0 |
| 16 | 8 | 0.075 | 113.1 | 40.4 | 719.4 | 53.7 | 717.5 | 44.2 | 266.1 | 32.4 | 273.8 | 30.7 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\newcites
appendixAdditional References
Thompson Sampling for Combinatorial Network Optimization in Unknown Environments
Alihan Hüyük, Cem Tekin ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A. Hüyük was with the Department of Electrical and Electronics Engineering, Bilkent University, Ankara 06800, Turkey. He is now with the Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge CB3 0WA, UK (e-mail: [email protected]).C. Tekin is with the Department of Electrical and Electronics Engineering, Bilkent University, Ankara 06830, Turkey (e-mail: [email protected]).This work was supported in part by the Scientific and Technological Research Council of Turkey under Grant 215E342.A preliminary version of this work was presented in AISTATS 2019 [1].
Abstract
Influence maximization, adaptive routing, and dynamic spectrum allocation all require choosing the right action from a large set of alternatives. Thanks to the advances in combinatorial optimization, these and many similar problems can be efficiently solved given an environment with known stochasticity. In this paper, we take this one step further and focus on combinatorial optimization in unknown environments. We consider a very general learning framework called combinatorial multi-armed bandit with probabilistically triggered arms and a very powerful Bayesian algorithm called Combinatorial Thompson Sampling (CTS). Under the semi-bandit feedback model and assuming access to an oracle without knowing the expected base arm outcomes beforehand, we show that when the expected reward is Lipschitz continuous in the expected base arm outcomes CTS achieves regret and Bayesian regret, where denotes the number of base arms, and denote the minimum non-zero triggering probability and the minimum suboptimality gap of base arm respectively, denotes the time horizon, and denotes the overall minimum non-zero triggering probability. We also show that when the expected reward satisfies the triggering probability modulated Lipschitz continuity, CTS achieves Bayesian regret, and when triggering probabilities are non-zero for all base arms, CTS achieves regret independent of the time horizon. Finally, we numerically compare CTS with algorithms based on upper confidence bounds in several networking problems and show that CTS outperforms these algorithms by at least an order of magnitude in majority of the cases.
Index Terms:
Combinatorial network optimization, multi-armed bandits, Thompson sampling, regret bounds, online learning.
I Introduction
How should an advertiser promote its products in a social network to reach to a large set of users with a limited budget [2, 3]? How should a search engine suggest a ranked list of items to its users to maximize the click-through rate [4]? How should a base station allocate its users to channels to maximize the system throughput [5]? How should a mobile crowdsourcing platform dynamically assign available tasks to its workers to maximize the performance [6]? How can we identify the most reliable paths from source to destination under probabilistic link failures [7]? All of these problems require optimizing decisions among a vast set of alternatives. When the probabilistic description of the environment is fully specified, these problems—and many others—are solved using computationally efficient exact or approximation algorithms. In this paper, we focus on a much more difficult and realistic problem: How should we learn the optimal decisions in these complex problems via repeated interaction with the environment when the probabilistic description of the environment is unknown or only partially known?
It is natural to assume that the environment is unknown in many real-world applications. For instance, the advertiser may not know with what probability user will influence its neighbor in a social network or the search engine may not know with what probability user will click the item shown on position beforehand. Moreover, decisions are need to be made sequentially over time. For instance, the recommender system should show a new list of items to each arriving user and the base station should reallocate network resources when the channel conditions change or the users leave/enter the system. Obviously, future decisions of the learner must be guided based on what it has observed thus far, i.e., the trajectory of actions, observations and rewards generated by the learner’s past decisions. Importantly, both the cumulative reward of the learner and what it has learned so far also depend on this trajectory. Therefore, the learner needs to balance how much it earns (by exploiting the actions it believes to be the best) and how much it learns (by exploring actions it does not know much about) in order to maximize its long-term performance. In this paper, we solve the formidable task of combinatorial optimization in unknown environments by modeling it as a combinatorial multi-armed bandit (MAB).
MAB problems have a long history as they exhibit the prime example of the tradeoff between exploration and exploitation [5, 8]. In the classical MAB, at each round the learner selects an arm (action) which yields a random reward that comes from an unknown distribution. The goal of the learner is to maximize its expected cumulative reward over all rounds by learning to select arms that yield high rewards. The learner’s performance is measured by its regret with respect to an oracle that always selects the arm with the highest expected reward. It is shown that when the arms’ rewards are independent, any uniformly good policy will incur at least logarithmic in time regret [9].
Several classes of policies are proposed for the learner to minimize its regret. One example is Thompson sampling [10, 11, 12], which is a Bayesian method. In this method, the learner keeps a posterior distribution over the expected arm rewards, and at each round takes a sample from each arm’s posterior, and then, plays the arm with the largest sample. Reward observed from the played arm is then used to update its posterior. This sampling strategy allows the learner to frequently select the arms whose probabilities of being optimal are the highest based on their posteriors and to occasionally explore inferior arms to refine their posteriors. Policies in the other end of the spectrum use the principle of optimism under the face of uncertainty. Notable examples include policies based on upper confidence bound (UCB) indices [9, 13, 14], which are usually composed of sample mean reward of an arm plus an exploration bonus that accounts for the uncertainty in the arm’s reward estimates. The strategy is to play the arm with the highest UCB index to tradeoff exploration and exploitation. Unlike Thompson sampling, performance of this type of policies heavily rely on the confidence sets used to compute the exploration bonus [12]. This together with the superior performance of Thompson sampling documented in numerous applications [15, 16] motivate us to consider a Thompson sampling based approach for our problem.
Our main focus in this paper, i.e., combinatorial MAB (CMAB) [5, 17, 18, 19], is an extension of MAB where the learner selects a super arm at each round, which is defined to be a subset of the base arms. Then, the learner observes and collects the reward associated with the selected super arm, and also observes the outcomes of the base arms that are in the selected super arm. This type of feedback is also called semi-bandit feedback. For instance, when allocating users to orthogonal channels, each user-channel pair represents a base arm, the super arm is the set of user-channel pairs in the selected allocation, outcomes of base arms are indicators of successful packet transmissions and the reward is the number of packets successfully transmitted, i.e., sum of the indicators. While CMAB is general enough to model the aforementioned resource allocation problem, it does not fully capture the probabilistic structure of influence maximization, item list recommendation and reliable packet routing applications discussed in the preceding paragraphs. Therefore, we consider a generalized version of CMAB, called CMAB with probabilistically triggered arms (CMAB-PTA) [20], where the selected super arm probabilistically triggers a set of base arms, and the expected reward obtained in a round is a function of the set of triggered base arms and their expected outcomes. For instance, in influence maximization, each edge of the graph represents a base arm, the super arm is the selected seed set of nodes, outcomes of base arms are indicators of influence propagation on the corresponding edge (see, e.g., the independent cascade model [21]) and the reward is the number of influenced nodes, i.e., the set of nodes reachable from the seed set of nodes after the outcomes of base arms are realized. Triggered base arms in this case correspond to the set of edges that originate from all influenced nodes (including the seed set).
The regret for CMAB-PTA is defined as the difference between the expected cumulative reward of an oracle that always selects the super arm with the highest expected reward and that of the learner given a particular environment. Then, the Bayesian regret is the expected regret over all possible environments. Our goal is to design an algorithm that achieves the smallest rate of growth of the (Bayesian) regret over time, as this will ensure that the average reward of the learner will converge to the highest possible expected reward. To this end, we propose a Bayesian algorithm called combinatorial Thompson sampling (CTS) and analyze its regret assuming that the learner does not know the expected base arm outcomes beforehand but has access to an exact optimization oracle. Essentially, this oracle outputs an estimated optimal super arm given estimates of expected base arm outcomes as inputs. When the expected reward is Lipschitz continuous in the expected base arm outcomes, we show that CTS achieves regret and Bayesian regret, where denotes the number of base arms, denotes the minimum non-zero triggering probability of base arm , denotes the minimum suboptimality gap of base arm , denotes the time horizon, and denotes the overall minimum non-zero triggering probability. We also show that when the expected reward satisfies the triggering probability modulated (TPM) Lipschitz continuity in [22], which is a stronger assumption than the regular Lipschitz continuity yet still satisfied by the network optimization problems that we consider, CTS achieves Bayesian regret independent of the triggering probabilities.
In addition to these more general cases, we also prove that when triggering probabilities are non-zero for all base arms, CTS achieves regret independent of the time horizon. This setting is of particular interest since it can model random behavior of users in a recommender system. For instance, a user may rate an item even when it is not in the list of recommended items as a result of an exogenous event (by rating the item on a partner website or by explicitly navigating to the item to rate it). Moreover, it is also closely linked to related work on online learning with probabilistic graph feedback [23, 24] and MAB with side observations [25]. Specifically, the models in [24] and [25] become special cases of our work when the graph is fully-connected for the one-step case and connected for the cascade case in [24] and when the probability of having an observation from any arm is non-zero in [25].
We complement our theoretical findings via extensive simulations in the following combinatorial network optimization problems: cascading bandits [4], probabilistic maximum coverage bandits [20] and influence maximization bandits [20]. For cascading bandits, we show that CTS, which uses Beta posterior on base arms significantly outperforms all competitor algorithms that use either UCB indices [4] or Thompson sampling with Gaussian posterior [26]. The latter finding emphasizes the importance of working with the correct type of posterior. For probabilistic maximum coverage bandits, we show that CTS achieves an order of magnitude improvement over combinatorial UCB (CUCB) in [20] when both algorithms use an exact oracle. For influence maximization bandits, we show a similar result even when both algorithms use an approximation oracle instead of an exact oracle.
In summary, the main contribution of this paper is to analyze Thompson sampling for a very general combinatorial online learning framework that is comprehensive enough to model many different sequential decision-making applications defined over networks and show its optimality both theoretically and experimentally. The rest of the paper is organized as follows. Related work is given in Section II followed by problem formulation in Section III. Applications of CMAB-PTA are detailed in Section IV. Description of CTS and regret bounds are given in Section V. Proofs of the main results are explained in Sections VI and VII (some proofs are left to the supplemental document). Numerical results are presented in Section VIII and concluding remarks are given in Section IX.
II Related Work
CMAB has been studied under various assumptions on the relation between super arms, base arms and rewards [17]. Here, we mainly discuss the related works that assume semi-bandit feedback as we do in our work. A version of CMAB in which the expected reward of a super arm is a linear combination of the expected outcomes of the base arms in that super arm is studied in [5]. For this problem, it is shown in [18] that a combinatorial version of UCB1 in [14] achieves gap-dependent and gap-free (worst-case) regrets, where is the number of base arms, is the maximum number of base arms in a super arm, and is the gap between the expected reward of the optimal super arm and the second best super arm.
Later on, this setting is generalized to allow the expected reward of each super arm to be a more general function of the expected outcomes of the base arms that obeys certain monotonicity and bounded smoothness conditions [19]. The main challenge in the general case is that the optimization problem itself is NP-hard, but an approximately optimal solution can usually be computed efficiently for many special cases [27]. Therefore, it is assumed that the learner has access to an approximation oracle, which can output a super arm that has expected reward that is at least fraction of the optimal reward with probability at least when given the expected outcomes of the base arms. Thus, the regret is measured with respect to the fraction of the optimal reward, and it is proven that a combinatorial variant of UCB1, called CUCB, achieves regret when the bounded smoothness function is for some , where is the minimum gap between the expected reward of the optimal super arm and the expected reward of any suboptimal super arm that contains base arm .
Recently, it is shown in [28] that Thompson sampling can achieve regret for the general CMAB under a Lipschitz continuity assumption on the expected reward, given that the learner has access to an exact computation oracle, which outputs an optimal super arm when given the set of expected base arm outcomes. Moreover, it is also shown that in general the learner cannot guarantee sublinear regret when it only has access to an approximation oracle. Since the setting studied in this paper is a special case of ours, for our theoretical analysis we also assume that the learner uses an exact computation oracle. Nevertheless, we show in Section VIII that in practice CTS works well even when used with an approximation oracle. Another related work on CMAB [29] considers a new smoothness condition termed the Gini-weighted smoothness on the expected reward. For some problem types, this leads to regret bounds with better dependency on the sizes of super arms when compared with the common linear dependency of the existing algorithms.
Different from CMAB, papers on CMAB-PTA assume that the expected reward is a function of the expected outcomes of the triggered base arms, which is a random superset of base arms in the selected super arm. For this problem, it is shown in [20] that logarithmic regret is achievable when the expected reward function has the bounded smoothness property. However, this bound depends on , where is the minimum non-zero triggering probability. Later, it is shown in [22] that under a stricter smoothness assumption on the expected reward function, called triggering probability modulated (TPM) bounded smoothness, it is possible to achieve regret that does not depend on . It is also shown in this work that the dependence on is unavoidable for the general case. In another work [30], CMAB-PTA is considered for the case when the arm triggering probabilities are all positive, and it is shown that both CUCB and CTS achieve bounded regret. However, their bound has a much worse dependence on than our bound.
Apart from the works mentioned above, numerous other works also tackle related online learning problems. For instance, [31] considers matroid bandits, which is a special case of CMAB where the super arms are given as independent sets of a matroid with base arms being the elements of the ground set, and the expected reward of a super arm is the sum of the expected outcomes of the base arms in the super arm. Another example is cascading bandits [4], which is a special case of CMAB-PTA, where each super arm corresponds to a ranked list of items and base arms are triggered according to a user click model. A plethora of papers exist on UCB based policies for variants of these two models (see e.g., [32] for a variant of matroid bandits and [33] and [34] for variants of cascading bandits.) Apart from these, [26] considers Thompson sampling with Gaussian posterior for cascading bandits and proves that the worst-case regret is . We show in Section VIII that CTS significantly outperforms their algorithm for cascading bandits. We think that this is the case in practice because Beta posterior is more suitable in modeling click probabilities compared to Gaussian posterior.
Several other works focus on contextual CMAB [35, 34, 36], CMAB with adversarial rewards [37, 38] and CMAB with knapsacks [39]. Most recently there has been a surge of interest in analyzing CMAB under the full-bandit feedback setting, where the learner only observes the reward of the selected super arm but not the outcomes of the base arms [40, 41]. For instance, [41] uses a sampling method based on Hadamard matrices to estimate base arm rewards from full-bandit feedback. On the other hand, [42] considers a more general feedback model where the learner observes a linear combination of base arm’s rewards. Table I compares our work with the most closely related publications in terms of their assumptions and the regret bounds they show.
III Problem Formulation
CMAB-PTA is a decision-making problem where the learner interacts with its environment through base arms, indexed by the set sequentially over rounds indexed by . In this paper, we consider the model introduced in [20] and borrow the notation from [28]. In this model, the following events take place in order in each round :
- •
The learner selects a subset of base arms, denoted by , which is called a super arm.
- •
causes some other base arms to probabilistically trigger based on a stochastic triggering process, which results in a set of triggered base arms that contains .
- •
The learner obtains a reward that depends on and observes the outcomes of the base arms in .
Next, we describe in detail the base arm outcomes, the super arms, the triggering process, the reward, the observation (feedback) model and the regret.
III-A Base Arm Outcomes
In each round , the environment draws a random outcome vector from a probability distribution on independent of the previous rounds, where represents the outcome of base arm . is unknown by the learner, but it belongs to a class of distributions which is known by the learner. We define the mean outcome (parameter) vector as , where , and use to denote the projection of on for .
Since CTS computes a posterior over , the following assumption is made to have an efficient and simple update of the posterior distribution.
Assumption 1**.**
The outcomes of all base arms are mutually independent, i.e., .
Note that this independence assumption holds in many applications, including the influence maximization problem with independent cascade influence propagation model [21].
III-B Super Arms and the Triggering Process
The learner is allowed to select from a subset of denoted by , which corresponds to the set of feasible super arms. Once is selected, all base arms are immediately triggered. These arms can trigger other base arms that are not in , and those arms can further trigger other base arms, and so on. At the end, a random superset of is formed that consists of all triggered base arms as a result of selecting . We have , where is the probabilistic triggering function that describes the triggering process. For instance, in the influence maximization problem, may correspond to the independent cascade influence propagation model defined over a given influence graph [21]. The triggering process can also be described by a set of triggering probabilities. For each and , denotes the probability that base arm is triggered when super arm is selected given that the arm outcome distribution is . For simplicity, we let , where is the true arm outcome distribution. Let be the set of all base arms that could potentially be triggered by super arm , which is called the triggering set of . We have that . We define as the minimum nonzero triggering probability of base arm , and as the minimum nonzero triggering probability.
Before moving on, we would like to point out that the entire triggering process could have been represented by writing , where any possible dependence of the process on the outcome distribution would have been hidden inside . Instead, we chose to break down the triggering process into two stages: and , where and together are equivalent to . This is motivated by the prior knowledge of the learner. Note that, while the learner fully knows , it does not know anything about except the class of distributions that it belongs to, resulting in only a partial knowledge about .
III-C Reward
At the end of round , the learner receives a reward that depends on the set of triggered arms and the outcome vector , which is denoted by . For simplicity of notation, we also use to denote the reward in round . Note that whether a base arm is in the selected super arm or is triggered afterwards is not relevant in terms of the reward. We assume that the expected reward depends on the mean outcome vector in a specific way by making the following mild assumptions about the expected reward function. We note that these assumptions are standard in the CMAB literature [20, 28] and hold for the networking applications given in Section IV. The first assumption states that the expected reward is only a function of and .
Assumption 2**.**
The expected reward of super arm only depends on and the mean outcome vector , i.e., there exists a function such that
[TABLE]
In order to learn the best action, we require the estimate of the expected reward vector to converge to the true expected reward vector as the number of observations increases. This can be done when the expected reward varies smoothly with the mean outcome vector. Below, we state a form of continuity for the expected reward.
Assumption 3**.**
(Lipschitz continuity) There exists a constant , such that for every super arm and every pair of mean outcome vectors and , we have
[TABLE]
where denotes the norm.
In addition to Lipschitz continuity, we also consider the triggering probability modulated (TPM) Lipschitz continuity introduced in [22]. This is a stricter assumption than the regular Lipschitz continuity (one implies the other) but leads to tighter regret bounds in terms of the triggering probabilities. All of the networking applications considered in Section IV still satisfy the TPM Lipschitz continuity.
Assumption 4**.**
(Triggering probability modulated Lipschitz continuity) There exists a constant , such that for every super arm and every pair of outcome distributions and with mean outcome vectors and respectively, we have
[TABLE]
Finally, we require a monotonicity assumption in order to facilitate the UCB-based analysis that some of our results rely on, namely Theorems 2 and 3. Again, all of the networking applications considered in Section IV satisfy the following monotonicity assumption.
Assumption 5**.**
For every super arm and every pair of mean outcome vectors and , we have if for all .
III-D Observation Model
We consider the semi-bandit feedback model, where at the end of round , the learner observes the individual outcomes of the triggered arms, denoted by . Again, for simplicity of notation, we also use to denote the observation at the end of round . Based on this, the only information available to the learner when choosing the super arm to select in round is its observation history, given as .
In short, the tuple constitutes a CMAB-PTA problem instance. Among the elements of this tuple only is unknown to the learner.
III-E Regret
In order to evaluate the performance of the learner, we define the set of optimal super arms given an -dimensional parameter vector as . We use to denote the set of optimal super arms given the true mean outcome vector . Based on this, we let to represent a specific super arm in , which is the set of super arms that have triggering sets with minimum cardinality among all optimal super arms. We also let and .
Next, we define the suboptimality gap due to selecting super arm as , the maximum suboptimality gap as , and the minimum suboptimality gap of base arm as .111If there is no such super arm , let . The goal of the learner is to minimize the (expected) regret over the time horizon , given by
[TABLE]
In addition to the expected regret, we also consider the Bayesian regret, given by
[TABLE]
where the true mean outcome vector is viewed as a random variable. For simplicity, we will assume that has a uniform prior. However, this can easily be extended to any other Dirichlet prior simply by modifying the initial values of ’s and ’s in Algorithm 1, which determine the initial prior over the base arm outcomes. It is important to note here that asymptotic bounds on the Bayesian regret are essentially asymptotic (gap-free) bounds on the regret [12]. Formally, if for some non-negative function , then , that is there exists such that for all there exists such that for all .
IV Networking Applications
Here, we introduce three networking applications of CMAB-PTA: cascading bandits, probabilistic maximum coverage bandits, and influence maximization bandits. Numerical experiments given in Section VIII explore specific cases of all these problems that are generated either synthetically or from real-world data.
IV-A Cascading Bandits
IV-A1 Disjunctive Form for Search Engine Optimization
In the disjunctive form of the cascading bandit problem [4], a search engine outputs a list of web pages for each of its users among a set of web pages. Then, the users examine their respective lists, and click on the first page that they find attractive. If all pages fail to attract them, they do not click on any page. The goal of the search engine is to maximize the number of clicks.
This problem can be modeled as an instance of CMAB-PTA as follows. The base arms are page-user pairs , where and . User finds page attractive independent of other users and other pages with probability . The super arms are -many lists of -tuples, where each -tuple represents the list of pages shown to a user. Given a super arm , let denote the th page that is selected for user . Then, the triggering probabilities can be written as
[TABLE]
that is we observe feedback for a top selection immediately, and observe feedback for the other selections only if all previous selections fail to attract the user. The expected reward of playing super arm can be written as
[TABLE]
for which Assumptions 3 and 4 hold when and respectively.
IV-A2 Conjunctive Form for Network Routing Reliability
One can also consider the conjunctive analogue of the problem, where the goal of the search engine is to—somewhat peculiarly—maximize the number of users with lists that do not contain any unattractive page, and when examining their lists, users provide feedback by reporting the first unattractive page. Formally,
[TABLE]
and
[TABLE]
This conjunctive form fits particularly well to the network reliability problem [7], where we are interested in finding the most reliable routing path in a communication network. We consider routing paths as super arms, being the set of all possible routing paths. Each routing path consists of a variable number of ordered links that correspond to the base arms. We denote the index of th link in routing path as and the length of the path as . Each link in a routing path can fail independently from all other links with probability . Then, the probabilistic reliability of a routing path is defined as the probability of successful operation with no link in the path failing.
Since we can only observe whether a link has failed or not up to the first link that has failed, the triggering probability of link when routing path is selected can be written as
[TABLE]
and the probabilistic reliability of routing path —in other words, the expected reward—becomes
[TABLE]
IV-B Probabilistic Maximum Coverage Bandits
In the probabilistic maximum coverage problem, an online shopping site advertises items that are selected from a catalog of items to its users. Each user inspects all of the items that are advertised and likes one of the attractive items. The users do not like any item if none of the items attract them. The goal of the shopping site is to maximize the number of likes. Analogous to cascading bandits, in this problem, base arms are item-user pairs , where and . User finds item attractive independent of other users and other items with probability . The super arms are the set of all pairs such that item is the element of a size- subset of .
This can also model the problem of allocating orthogonal channels to secondary users in a cognitive radio network [5]. Consider as the number of orthogonal channels, as the number of secondary users (), and as the expected throughput that user can obtain using channel . We would like to maximize the expected sum throughput by allocating each user a unique channel so that if and only if for all . Given one such allocation, the corresponding super arm would be the set and the expected reward of it can be written as . Allocating orthogonal channels to secondary users can also be conceptualized as allocating tasks to workers in a mobile crowdsourcing platform [6, 43]. Then, would be the probability of worker completing task successfully and would be the expected number of completed tasks.
In its classical form, this problem does not have any PTAs. In order to provide an example case with strictly positive triggering probabilities, we introduce the word-of-mouth effect as follows. Regardless of the shopping site’s decisions, we assume that users inspect, i.e., they explicitly search or navigate to, unadvertised items independently with probability .222For simplicity we assume that is the same for all items while it can be different in practice. This can happen if users hear about the items outside of the shopping site (e.g., from their friends or from another venue). Then, the triggering probabilities can be written as
[TABLE]
and the expected reward of super arm can be written as
[TABLE]
for which Assumption 3 and 4 hold when and respectively.
IV-C Influence Maximization Bandits
In the influence maximization problem with the independent cascade model [21], the learner is given a directed graph denoted by , where is the set of nodes and is the set of edges. The learner selects and triggers a set of nodes such that , where is one of the problem parameters. This is the first iteration of a diffusion process. In each subsequent iteration, a node that was triggered in the previous iteration might trigger another node that is not triggered yet if is adjacent to one of its outgoing edges. This happens with probability independently from the states of all other nodes. The diffusion process ends when no new node triggers in an iteration. The goal of the learner is to maximize—through the initial decision of nodes—the number of triggered nodes at the end of the diffusion process.
The problem can be modeled as a CMAB problem with PTAs, where base arms are edges and super arms are the set of all edges such that .333This is equivalent to defining the super arm as itself. Assumption 3 holds as proven in Lemma 6 in [20] and Assumption 4 holds as proven in Lemma 2 in [22].
V Combinatorial Thompson Sampling
CTS is a Bayesian algorithm that selects super arms by sampling from posterior distributions of base arms. Its pseudocode is given in Algorithm 1. We assume that the learner has access to an exact computation oracle, which takes as input an -dimensional parameter vector and the problem structure , and outputs a super arm, denoted by such that . CTS keeps a Beta posterior over the mean outcome of each base arm. At the beginning of round , for each base arm it draws a sample from its posterior distribution. Then, it forms the parameter vector in round as , gives it to the exact computational oracle, and selects the super arm . At the end of the round, CTS updates the posterior distributions of the triggered base arms using the observation .
V-A Regret of CTS under Lipchitz Continuity
Theorem 1**.**
Under Assumptions 1, 2, and 3, for all , the regret of CTS by round is bounded as
[TABLE]
for all , and for all such that , where is the Lipschitz constant in Assumption 3, is a problem independent constant that is also independent of , and is the maximum triggering set size among all super arms.
We compare the result in Theorem 1 with [20], which shows that the regret of CUCB is given an bounded smoothness condition on the expected reward function and a bounded smoothness function of . When is sufficiently small, the regret bound in Theorem 1 is asymptotically equivalent to the regret bound for CUCB (in terms of the dependence on , , and for ). For the case with (no probabilistic triggering), the regret bound in Theorem 1 matches with the regret bound in Theorem 1 in [28] (in terms of the dependence on and for ).
As final remarks, it is shown in Theorem 3 in [22] that the factor that multiplies the term is unavoidable in general. Moreover, regarding the exponential term , it is shown in Theorem 3 in [28] that there is at least one instance of CMAB (hence, also an instance of CMAB-PTA) where the regret of CTS is . Intuitively, such an exponential term is unavoidable since for CTS to select an optimal super arm that can trigger base arms, all of the samples from those base arms should independently be close to their true means. The proof of Theorem 1 is given in the supplemental document. It can also be found in the conference version of the paper [1].
V-B Bayesian Regret of CTS under Lipchitz Continuity
Theorem 2**.**
Under Assumptions 1, 2, 3, and 5, when averaged over , the Bayesian regret of CTS by round is bounded as
[TABLE]
for all , where is the Lipschitz constant in Assumption 3.
As mentioned in Section III-E, the Bayesian regret bound in Theorem 2 can be interpreted as a gap-free regret bound for CTS that holds asymptotically.
V-C Bayesian Regret of CTS under the TPM Lipchitz Continuity
Theorem 3**.**
Under Assumptions 1, 2, 4, and 5, when averaged over , the Bayesian regret of CTS by round is bounded as
[TABLE]
where is the Lipschitz constant in Assumption 4.
We improve the Bayesian regret bound in Theorem 2 under the stricter TPM Lipchitz continuity assumption and obtain a regret bound that is completely-free of triggering probabilities. Similar to Theorem 2, the Bayesian regret bound in Theorem 3 can be interpreted as an asymptotic regret bound for CTS.
V-D Regret of CTS for Strictly Positive Triggering Probabilities
We improve the regret bound in Theorem 1 when all triggering probabilities are strictly positive.
Theorem 4**.**
Under Assumptions 1, 2, and 3, for all such that , the regret of CTS by round is bounded as
[TABLE]
for all , and for all such that , where is the Lipschitz constant in Assumption 3, is a problem independent constant that is also independent of , and is the maximum triggering set size among all super arms.
Note that having all triggering probabilities be strictly positive makes the exploration aspect of the MAB problem trivial. No matter which actions the learner takes, all base arms provide occasional feedback. As a result of this, the upper bound for the expected regret becomes independent of the time horizon . We compare the result of Theorem 4 with [30], which shows a similar bound for CTS in the exact same setting. While the bound in [30] is on order with respect to , the bound in Theorem 4 is on order .
As a final remark, we observe that the regret bound in Theorem 4 does not match the lower bound on order given in Theorem 1 in [25] proven for a special case of our setting, where rewards only depend on the selected arm. Assumptions 3 and 4, on the other hand, allow rewards to depend on all arms in the triggering set of the selected super arm either independent of or proportionally to their triggering probabilities. Considering how the reward model in [25] satisfies both Assumption 3 and Assumption 4 and how Assumption 4 is necesary to get rid of the terms in the previously discussed upper bounds, showing an upper bound on order instead of order for the case with strictly positive triggering probabilities might only be possible under Assumption 4. The proof of Theorem 4 is given in the supplemental document.
VI Proof of Theorem 2
We extend the proof technique used in [12] to CMAB-PTA. The technique relies on Fact 1, which establishes a relationship between Thompson sampling and upper confidence sequences commonly encountered in UCB-based analyses. According to Fact 1, the Bayesian regret is bounded by the difference between the true rewards and an upper confidence bound for the estimated rewards of the selected super arm and the optimal super arm. We show that these differences either shrink quickly as sample size increases (for the selected super arm) or are less than zero (for the optimal super arm) with overwhelming probability.
VI-A Preliminaries
All equalities and inequalities concerning random variables hold with probability . The complement of set is denoted by . The indicator function is given as . denotes the number of times base arm is tried to be triggered (i.e. it was in the triggering set of the selected super arm) until round , denotes the number of times base arm is triggered until round , and denotes the empirical mean outcome of base arm at the start of round , where is the Bernoulli random variable with mean that is used for updating the posterior distribution that corresponds to base arm in CTS.
Given a particular base arm , let be the round for which base arm is in the triggering set of the selected super arm for the th time and let . Note that we have and for all . In order to decompose the regret, we make use of an upper confidence bound sequence for the reward of super arm at round , where and
[TABLE]
We also make use of the following events:
[TABLE]
VI-B Facts and Lemmas
Fact 1**.**
(Proposition 1 in [12]) For any upper confidence bound sequence ,
[TABLE]
Proof.
Since is sampled from the posterior distribution of given observation history , and follow the same distribution when conditioned on . Together with the fact that is a deterministic function when conditioned on , we have
[TABLE]
for all . ∎
Fact 2**.**
(Lemma 1 in [12])
[TABLE]
Fact 3**.**
(Multiplicative Chernoff bound [20, 44]) Let be Bernoulli random variables taking values in such that for all , and . Then, for all ,
[TABLE]
Lemma 1**.**
When CTS is run, we have
[TABLE]
for all , , and .
Proof.
[TABLE]
VI-C Main Part of the Proof
We decompose the Bayesian regret as
[TABLE]
where (3) is due to Fact 1, and (6) is obtained by observing
[TABLE]
for all .
VI-C1 Bounding (4)
When and hold, we have
[TABLE]
where (7) is due to and (8) is due to . Then,
[TABLE]
where (9) holds since implies and (10) holds since .
VI-C2 Bounding (5)
When holds, we have
[TABLE]
for all . Then,
[TABLE]
where (11) is due to Assumption 5. Hence, .
VI-C3 Bounding (6)
We have
[TABLE]
where (12) is due to Fact 2 and Lemma 1 respectively for the two terms.
VII Proof of Theorem 3
In order to take advantage of Assumption 4, we use the concept of triggering probability groups from [22]. However, the rest of our analysis is quite different from [22] and mainly follows the same technique we have followed in Section VI when proving Theorem 2.
VII-A Preliminaries
In addition to the preliminaries in Section VI-A for the proof of Theorem 2, we make the following definitions. For , let denote the th triggering probability group of base arm and let denote the index of the triggering probability group of base arm i that super arm belongs to, i.e., is such that . We use these definitions to introduce the following counters: and . By definition, and .
Given a particular base arm , let be the round for which base arm is in the triggering set of the selected super arm and for the th time and let . Note that we have , , and for all . We also make the following change to event :
[TABLE]
VII-B Facts and Lemmas
Lemma 2**.**
Fix , and . When CTS is run, we have
[TABLE]
Proof.
[TABLE]
where (13) holds due to Fact 3 and (14) holds since implies that . ∎
VII-C Main Part of the Proof
We decompose the Bayesian regret the same way as we did in Section VI-C. Note that (6) still holds since
[TABLE]
for all .
VII-C1 Bounding (4)
When holds, one of the following must be the case:
[TABLE]
Combining the two result together, we obtain
[TABLE]
When also holds, we have
[TABLE]
where (16) is due to , (17) is due to (15), and (18) holds since . Then,
[TABLE]
where (19) holds since .
VII-C2 Bounding (5)
We bound (5) the same way we did in Section VI-C2.
VII-C3 Bounding (6)
We have
[TABLE]
where (20) is due to Fact 2 and (21) is due to Lemma 2.
VIII Numerical Results
In this section, we compare CTS with other state-of-the-art CMAB algorithms in three different applications: cascading bandits, probabilistic maximum coverage bandits, and influence maximization bandits introduced in Section IV. We compare the performance of CTS with CUCB in [20] in all settings. For the first two problems, we assume that all algorithms have access to an exact computation oracle that computes the estimated optimal super arm in each round. On the other hand, for the third problem, we assume that all algorithms use an approximation oracle. For cascading bandits only, we also compare CTS with algorithms specifically designed for this setting: CascadeKL-UCB in [4] and TS-Cascade in [26]. The former uses the principle of optimism under the face of uncertainty to compute Kullback-Leibler divergence based UCBs while the latter uses Thompson sampling with Gaussian posterior over the base arms.
VIII-A Cascading Bandits
We consider the disjunctive case with , and , and generate s by sampling uniformly at random from . We run both CTS and CUCB for rounds, and report their regrets averaged over runs in Fig. 1, where error bars represent the standard deviation of the regret (multiplied by 10 for visibility). In this setting CTS significantly outperforms CUCB by achieving a final regret that is no more than of the final regret of CUCB. Relatively bad performance of CUCB can be explained by excessive number of explorations due to the UCBs that stay high for a large number of rounds.
We also consider the same class of problems as in [4], where and the probability that the user finds page attractive is given as
[TABLE]
Similar to [4], we set and vary other parameters, namely , , and . We run both CTS and CUCB for rounds in all problem instances, and report their regrets averaged over runs in Table II.
In addition to CUCB, we compare CTS against CascadeUCB1 and CascadeKL-UCB given in [4], and TS-Cascade given in [26] as well. Note that regrets of CUCB and CascadeUCB1 matches very closely as two algorithms are essentially the same when CUCB is applied to cascading bandits except for some minor differences in the initialization stage and how UCBs larger than 1 are handled. We observe that CTS outperforms all other algorithms in all problem instances by achieving a regret that is at most of the regret of all other algorithms. For CTS, we also see that the regret increases as the number of pages () increases, it decreases as the number of recommended items () increases, and it increases as decreases, which are very similar to the major observations that are made in [4].
VIII-B Probabilistic Maximum Coverage Bandits
Our experimental setup for this case is based on MovieLens dataset [45] as in [30].444While the probabilistic maximum coverage problem is NP-hard, here we focus on a small-scale problem and use an exact computation oracle. The dataset contains 20 million movie ratings that are assigned between January 1995 and March 2015. Out of this, we only use the ones that are assigned between March 2014 and March 2015. In the experiments, the recommender chooses movies out of movies, which include of the most rated movies, of the least rated movies and randomly selected movies from the dataset. These movies are rated by users.
In total, there are genres in the dataset. Each movie belongs to at least one genre. We take genre information into account to define attraction probabilities. For this, we create a -dimensional vector for each movie , where if the movie belongs to genre and [math] otherwise. Using these vectors, we calculate a genre preference vector for each user as
[TABLE]
where is the set of movies that user rated and is a random vector such that for . The noise is introduced to model exploratory behavior of the user. Finally, defining and as the normalized versions of the vectors we have defined, the attraction probabilities are calculated as
[TABLE]
where is the average rating of movie .
We run both CTS and CUCB for rounds, and report their regrets averaged over runs in Fig. 2, where error bars represent standard deviation of the regret (multiplied by 100 for visibility). We consider two cases with and . For both cases, CTS significantly outperforms CUCB by achieving a final regret that is no more than of the final regret of CUCB.
VIII-C Influence Maximization Bandits
We consider a directed version of the Facebook network dataset [46] that consists of k edges and nodes. Since, the dataset does not contain influence probabilities, we artificially generate them by setting where represents the set of outgoing neighbors of node . We assume that in each round the learner selects a seed set of nodes and this set forms the selected super arm. Moreover, we assume that the influence propagates—starting from the seed set—according to the independent cascade model [21], which is one of the most widely used influence propagation models. We adopt the edge-level feedback model in which the learner both observes the set of influenced nodes and the influence outcomes of the outgoing edges of these nodes.
Since the problem itself is NP-hard, an exact computation oracle is computationally infeasible for the given graph size. Nevertheless, many computationally efficient approximation algorithms exist for the influence maximization problem (see e.g., CELF in [47], and TIM and TIM+ in [48]). Due to its computational efficiency and good performance in practice, we set the learner to use TIM+ as the approximation oracle. When given as input an influence graph with nodes and edges, the influence probabilities on these edges and parameters and , TIM+ is guaranteed to return an -approximate solution with probability at least and with time complexity . For all experiments, we set and . Since the learner uses an approximation oracle, instead of the regret given in (1) we consider the -approximation regret as given in [20] in the remainder of this section.
We run both CTS and CUCB for rounds and report their regrets averaged over runs in Fig. 3. Here, error bars represent standard deviation of the regret multiplied by for visibility. Note that in these simulations, we consider the realized regret of the learner’s actions instead of the expected regret as we do in the other experiments. This is once again due to the complexity of the problem and the difficulty in calculating expected regret. Again, it is observed that CTS significantly outperforms CUCB by achieving a final regret that is no more than of the final regret of CUCB. Relatively bad performance of CUCB is due to the fact that the considered time horizon is not long enough for CUCB to efficiently explore all base arms. It is observed that the UCBs of many base arms remain above even at the end of rounds. As an algorithm that is based on the principle of optimism in the face of uncertainty, CUCB’s performance completely depends on the confidence sets it uses to calculate the UCB indices, and this example shows that these confidence sets are not tight enough to guarantee fast convergence.
IX Conclusion
We analyzed the regret of CTS for CMAB-PTA and proved (i) an order optimal gap-dependent regret bound when the expected reward function is Lipschitz continuous without assuming monotonicity, (ii) a Bayesian regret bound equivalent to an asymptotic gap-free regret bound assuming monotonicity, (iii) a Bayesian regret bound that is independent of triggering probabilities under the triggering modulated Lipschitz continuity assumption, and (iv) an improved regret bound that is independent of the time horizon for the special case when the triggering probabilities are strictly positive.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Hüyük and C. Tekin, “Analysis of Thompson sampling for combinatorial multi-armed bandit with probabilistically triggered arms,” in Proc. 22nd Int. Conf. Artif. Intell. and Statist. , 2019, pp. 1322–1330.
- 2[2] T. N. Dinh, H. Zhang, D. T. Nguyen, and M. T. Thai, “Cost-effective viral marketing for time-critical campaigns in large-scale social networks,” IEEE/ACM Trans. Netw. , vol. 22, no. 6, pp. 2001–2011, 2014.
- 3[3] G. Tong, W. Wu, S. Tang, and D.-Z. Du, “Adaptive influence maximization in dynamic social networks,” IEEE/ACM Trans. Netw. , vol. 25, no. 1, pp. 112–125, 2017.
- 4[4] B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan, “Cascading bandits: learning to rank in the cascade model,” in Proc. 32nd Int. Conf. Mach. Learn. , 2015, pp. 767–776.
- 5[5] Y. Gai, B. Krishnamachari, and R. Jain, “Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations,” IEEE/ACM Trans. Netw. , vol. 20, no. 5, pp. 1466–1478, 2012.
- 6[6] S. K. née Müller, C. Tekin, M. van der Schaar, and A. Klein, “Context-aware hierarchical online learning for performance maximization in mobile crowdsourcing,” IEEE/ACM Trans. Netw. , vol. 26, no. 3, pp. 1334–1347, 2018.
- 7[7] H.-W. Lee, E. Modiano, and K. Lee, “Diverse routing in networks with probabilistic failures,” IEEE/ACM Trans. Netw. , vol. 18, no. 6, pp. 1895–1907, 2010.
- 8[8] H. Robbins, “Some aspects of the sequential design of experiments,” Bull. Amer. Math. Soc. , vol. 55, pp. 527–535, 1952.
