Thompson Sampling For Stochastic Bandits with Graph Feedback
Aristide C. Y. Tossou, Christos Dimitrakakis, Devdatt Dubhashi

TL;DR
This paper extends Thompson Sampling to stochastic bandit problems with unknown or changing graph feedback structures, providing theoretical regret guarantees and demonstrating superior empirical performance over UCB-based methods.
Contribution
It introduces a novel Thompson Sampling algorithm for graph feedback in stochastic bandits, applicable even with unknown or dynamic graph structures, with proven regret bounds.
Findings
Algorithm outperforms UCB-based methods on various real and simulated networks.
Theoretical regret bounds linked to graph properties.
Effective on diverse graph types including power law and social networks.
Abstract
We present a novel extension of Thompson Sampling for stochastic sequential decision problems with graph feedback, even when the graph structure itself is unknown and/or changing. We provide theoretical guarantees on the Bayesian regret of the algorithm, linking its performance to the underlying properties of the graph. Thompson Sampling has the advantage of being applicable without the need to construct complicated upper confidence bounds for different problems. We illustrate its performance through extensive experimental results on real and simulated networks with graph feedback. More specifically, we tested our algorithms on power law, planted partitions and Erdo's-Renyi graphs, as well as on graphs derived from Facebook and Flixster data. These all show that our algorithms clearly outperform related methods that employ upper confidence bounds, even if the latter use more information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Thompson Sampling For Stochastic Bandits with Graph Feedback
Aristide C. Y. Tossou
Computer Science and Engineering
Chalmers University of Technology
Gothenburg, Sweden
[email protected] \AndChristos Dimitrakakis
University of Lille, France
Chalmers University of Technology
Harvard University, USA
[email protected] \AndDevdatt Dubhashi
Computer Science and Engineering
Chalmers University of Technology
Gothenburg, Sweden
Abstract
We present a novel extension of Thompson Sampling for stochastic sequential decision problems with graph feedback, even when the graph structure itself is unknown and/or changing. We provide theoretical guarantees on the Bayesian regret of the algorithm, linking its performance to the underlying properties of the graph. Thompson Sampling has the advantage of being applicable without the need to construct complicated upper confidence bounds for different problems. We illustrate its performance through extensive experimental results on real and simulated networks with graph feedback. More specifically, we tested our algorithms on power law, planted partitions and Erdős–Rényi graphs, as well as on graphs derived from Facebook and Flixster data. These all show that our algorithms clearly outperform related methods that employ upper confidence bounds, even if the latter use more information about the graph.
1 Introduction
Sequential decision making problems under uncertainty appear in most modern applications, such as automated experimental design, recommendation systems and optimisation. The common structure of these applications that, at each time step , the decision-making agent is faced with a choice. After each decision, it obtains some problem-dependent feedback (?). For the so-called bandit problem, the choices are between different arms, and the feedback consists of a single scalar reward obtained by the arm at time . For the prediction (or full-information) problem, it obtains the reward of the chosen arm, but also observes the rewards of all other choices at time . In both cases, the problem is to maximise the total reward obtained over time. However, dealing with specific types of feedback may require specialised algorithms. In this paper, we show that the Thompson sampling algorithm can be applied successfully to a range of sequential decision problems, whose feedback structure is characterised by a graph.
Our algorithm is an extension of Thompson sampling, introduced in (?). Although easy to implement and effective in practice, it remained unpopular until relatively recently. Interest grew after empirical studies (?; ?) demonstrated performance exceeding state of the art. This has prompted a surge of interest in Thompson sampling, with the first theoretical results (?) and industrial adoption (?) appearing only recently. However, there are still only a few theoretical results and many of these are in the simplest settings. However, it is easy to implement and effective under very many different settings with complex feedback structures, and there is thus great need to extend the theoretical results to these wider settings.
? argue that Thompson sampling is a very effective and versatile strategy for different information structures. Their paper focuses on specific examples: the two extreme cases of no and full information mentioned above and the case of linear bandits and combinatorial feedback.
Here we consider the case where the feedback is defined through a graph (?; ?). More specifically, the arms (choices) are vertices of a (potentially changing) graph and when an arm is chosen, we see the reward of that arm as well as its neighbours. On one hand, it is a clean model for theoretical and experimental analysis and on the other hand, it also corresponds to realistic settings in social networks, for example in advertisement settings (c.f. (?)).
We provide a problem-independent111In the sense that it does not depend on the reward structure. regret bound that is parametrized by the clique cover number of the graph and naturally generalizes the two extreme cases of zero and full information. We present two variants of Thompson sampling, that are both very easy to implement and computationally efficient. The first is straightforward Thompson sampling, and so draws an arm according to its probability of being the best, but also uses the graph feedback to update the posterior distribution. The second one can be seen as sampling cliques in the graph according to their probability of containing the best arm, and then choosing the empirically best arm in that clique. Neither algorithm requires knowledge of the complete graph.
Almost all previous algorithms require the full structure of the feedback graph in order to operate. Some require the entire graph for performing their updates only at the end of round (e.g. (?)) Others actually need the description of the graph at the beginning of the round to make their decision and almost none of the algorithms previously proposed in the literature is able to provide non-trivial regret guarantees without the feedback graphs being disclosed. However, ? (?) argue that the assumption that the entire observation system is revealed to the learner on each round, even if only after making her prediction, is rather unnatural. In principle, the learner need not be even aware of the fact that there is a graph underlying the feedback model; the feedback graph is merely a technical notion for us to specify a set of observations for each of the possible arms. Ideally, the only signal we would like the agent to receive following each round is the set of observations that corresponds to the arm she has taken on that round (in addition to the obtained reward). Our algorithms work in this setup - they do not need the whole graph to be disclosed either when selecting the arm or when updating beliefs - only the local neighborhood is needed. Furthermore, the underlying graph is allowed to change arbitrarily at each step. The detailed proofs of all our main results are available in the full version of this paper.
2 Setting
2.1 The stochastic bandit model
The stochastic K-armed bandit problem is a well known sequential decision problem involving an agent sequentially choosing among a set of K arms . At each round , the agent plays an arm and receives a reward , where is a random variable defined on some probability space and is a reward function.
Each arm has mean reward \mu_{i}(P)=\mathop{\mbox{\mathbb{E}}}\nolimits_{P}R(Y_{t,i}). Our goal is to maximize its expected cumulative reward after rounds. An equivalent notion is to minimize the expected regret against an oracle which knows . More formally, the expected regret \mathop{\mbox{\mathbb{E}}}\nolimits^{\pi}_{P}\mathcal{L}{} of an agent policy for a bandit problem is defined as:
[TABLE]
where is the mean of the optimal arm and is the policy of the agent, defining a probability distribution on the next arm given the history of previous arms and rewards.
The main challenge in this model is that the agent does not know , and it only observes the reward of the arm it played. As a consequence, the agent must trade-off exploitation (taking the apparently best arm) with exploration (trying out other arms to assess their quality).
The Bayesian setting offers a natural way to model this uncertainty, by assuming that the underlying probability law is in some set parametrised by , over which we define a prior probability distribution \mathop{\mbox{\mathbb{P}}}\nolimits. In that case, we can define the Bayesian regret:
[TABLE]
A policy with small Bayesian regret may not be uniformly good in all . Since in the Bayesian setting we frequently need to discuss posterior probabilities and expectations, we also introduce the notation \mathop{\mbox{\mathbb{E}}}\nolimits_{t}f\mathrel{\triangleq}\mathop{\mbox{\mathbb{E}}}\nolimits(f\mid h_{t}) and \mathop{\mbox{\mathbb{P}}}\nolimits_{t}(\cdot)\mathrel{\triangleq}\mathop{\mbox{\mathbb{P}}}\nolimits(\cdot\mid h_{t}) for expectations and probabilities conditioned on the current history.
2.2 The graph feedback model
In this model, we assume the existence of an undirected graph with vertices corresponding to arms. By taking an arm , we not only receive the reward of the arm we played, but we also observe the rewards of all neighbouring arms . More precisely, at each time-step we observe for all , while our reward is still .
If the graph is empty, then the setting is equivalent to the bandit problem. If the graph is fully connected, then it is equivalent to the prediction (i.e. full information) problem. However, many practical graphs, such as those derived from social networks, have an intermediate connectivity. In such cases, the amount of information that we can obtain by picking an arm can be characterised by graph properties, such as the clique cover number:
Definition 2.1** (Clique cover number).**
A clique covering of a graph is a partition of all its vertices into sets such that the sub-graph formed by each is a clique i.e. all vertices in are connected to each other in . The smallest number of cliques into which the nodes of can be partitioned is called the clique cover number. We denote by the minimum clique cover and its size, omitting when clear from the context.
The domination number is another useful similar notion for the amount of information that we can obtain.
Definition 2.2** (Domination number).**
A dominating set in a graph is a subset such that for every vertex , either or for some . The smallest size of a dominating set in is called the domination number of and denoted .
3 Related work and our contribution
Optimal policies for the stochastic multi-armed bandit problem were first characterised by (?), while index-based optimal policies for general non-parametric problems were given by (?). Later (?) proved finite-time regret bounds for a number of UCB (Upper Confidence Bound) index policies, while (?) proved finite-time bounds for index policies similar to those of (?), with problem-dependent bounds . Recently, a number of policies based on sampling from the posterior distribution (i.e. Thompson sampling(?)) were analysed in both the frequentist (?) and Bayesian setting (?) and shown to obtain the same order of regret bound for the stochastic case. For the adversarial bandit problem the bounds are of order . The analysis for the full information case generally results in bounds on the regret (?), i.e. with a much lower dependence on the number of arms.
Intermediate cases between full information and bandit feedback can be obtained through graph feedback, introduced in (?), which is the focus of this paper. In particular, (?) and (?) analysed graph feedback problems with stochastic and adversarial reward sequences respectively. Specifically, ? analysed variants of Upper Confidence Bound policies, for which they obtained problem-dependent bounds. In more recent work, (?) also introduced algorithms for graphs where the structure is never fully revealed showing that (unlike the bandit setting) there is a large gap in the regret between the adversarial and stochastic cases. In particular, they show that in the adversarial setting one cannot do any better than ignore all additional feedback, while they provide an action-elimination algorithm for the stochastic setting. Finally, (?) obtain a problem-dependent bound of the form where is the linear programming relaxation to and is the minimum degree of .
Contributions.
In this paper, we provide much simpler strategies based on Thompson sampling, with a matching regret bound. Unlike previous work, these are also applicable to graphs whose structure is unknown or changing over time. More specifically:
We extend (?) to graph-structured feedback, and obtain a problem-independent bound of . 2. 2.
Using planted partition models, we verify the bound’s dependence on the clique cover. 3. 3.
We provide experiments on data drawn from two types of random graphs: Erdős–Rényi graphs and power law graphs, showing that our algorithms clearly outperform UCB and its variations (?). 4. 4.
Finally, we measured the performance on graphs estimated from the data used in (?). Once again, Thompson sampling clearly outperforms UCB and its variants.
4 Algorithms and analysis
We consider two algorithms based on Thompson sampling. The first uses standard Thompson sampling to select arms. As this also reveals the rewards of neighbouring arms, the posterior is conditioned on those as well. The second algorithm uses Thompson sampling to select an arm, and then chooses the empirically best arm within that arm’s clique.
4.1 The TS-N policy
The TS-N policy is an adaptation of Thompson Sampling for graph-structured feedback. Thompson Sampling maintains a distribution over the problem parameters. At each step, it selects an arm according to the probability of its mean being the largest. It then observes a set of rewards which it uses to update its probability distribution over the parameters.
For the case where each arm has an independent parameter defining its reward distribution, we can update the distribution of all arms observed separately. A particularly simple case is when all the reward are generated from Bernoulli distributions. Then we can simply use a Beta prior for each arm, illustrated by the TS-N policy in Algorithm 1. We note that the algorithm trivially extends to other priors and families.
4.2 The TS-MaxN policy
The TS-N policy does not fully exploit the graphical structure. For example, as noted by (?), instead of doing exploration on arm we could explore an apparently better neighbour, which would give us the same information. More precisely, instead of picking arm , we pick the arm with the best empirical mean. The intuition behind it is that, if we take any arm in , we are going to observe anyway the reward of . So it is always better to exploit the best arm in . The resulting policy, TS-MaxN is summarized in Algorithm 2. Although our theoretical results do not apply to this policy, it can have better performance as it uses more information.
4.3 Analysis of TS-N policy
Russo and van Roy introduced an elegant approach to the analysis of Thompson sampling. They define the information ratio as a key quantity for analysing information structures:
[TABLE]
where \mathop{\mbox{\mathbb{E}}}\nolimits_{t} and denote expectation and mutual information respectively, conditioned on the history of arms and observations until time . They show that it follows very generally that
Proposition 4.1*.*
If almost surely for all , then, \mathop{\mbox{\mathbb{E}}}\nolimits\mathcal{L}(T,\pi^{TS})\leq\sqrt{\Gamma\mathbb{H}(\alpha_{1})T}.
Here denotes entropy. Thus to analyse the performance of Thompson sampling on a specific problem, one may focus on bounding the information ratio (4.1). For the (independent) -armed bandit case, they show that , while for full-information ( experts) case, they show that . We now give a simple but useful extension of their results which is intermediate between these cases.
Proposition 4.2*.*
Let be an equivalence relation defined on the arms with denoting the equivalence class of . Let for sequence of random variables . Then , half the number of equivalence classes.
This is a direct generalisation of propositions 3 and 4 in (?), to which it reduces when the equivalence relation is trivial (bandit case) or full (expert case).
We can now use Proposition 4.2 to analyse graph structured arms:
Lemma 4.1**.**
Let be a graph with corresponding to the arms and suppose that when an arm is played, we observe the rewards for all i.e we observe the rewards corresponding to both and all its neighbours. Let be a clique cover of i.e. a partition of into cliques. Then .
Applying Proposition 4.1 and Lemma 4.1, we get a performance guarantee for Thompson sampling with graph-structured feedback:
Theorem 4.3**.**
For Thompson sampling with feedback from the graph , we have \mathop{\mbox{\mathbb{E}}}\nolimits^{\pi^{TS}}\mathcal{L}\leq\sqrt{\frac{1}{2}\chi(\overline{G})\mathbb{H}(\alpha_{1})T}, where is the clique cover number of .
Remark 4.1*.*
The bandit and expert cases are special cases corresponding to the empty graph and the complete graph respectively since for the empty graph and for the complete graph.
Remark 4.2* (Planted Partition Models).*
The planted partition models or stochastic block models graphs are defined as follows (?; ?): first a fixed partition of the vertices into parts is chosen, then an edge between two vertices within the same class exists with probability and that between vertices in different classes exists with probability , independently with . If , then with high probability, the clique cover number of the resulting graph is (corresponding to the planted cliques). Thus for this class of graphs, the regret grows as as per Theorem 4.3. This is explored in Section 5. When but large, the planted partition graph is considered a good model of the structure of network communities.
If the underlying graph changes at each time step, then we also have the bound for the same algorithm:
Corollary 4.1**.**
Suppose the underlying graph at time is , then:
[TABLE]
Proof.
The information ratio at time is bounded by . ∎
5 Experiments
We compared our proposed algorithms in terms of the actual expected regret against a number of other algorithms that can take advantage of the graph structure. Our comparison was performed over both synthetic graphs and networks derived from real-world data.222Our source code and data sets will be made available on an open hosting website.
5.1 Algorithms and hyperparameters.
In all our experiments, we tested against the UCB-MaxN and UCB-N algorithms, introduced in (?). These are the analogues of our algorithms, using upper confidence bounds instead of Thompson sampling.
-greedy- and -greedy-LP.
For the real-world networks, we also evaluated our algorithms against a variant of -greedy-LP from (?). This is based on a linear program formulation for finding a lower bound on the size of the minimum dominating set. We observe first that their analysis holds for any fixed dominating set and the bound so obtained is . In particular, we may use a simple greedy algorithm to compute a near–optimal dominating set such that , where is the maximum degree of the graph. 333No polynomial time algorithm can guarantee a better approximation unless P=NP, (?) Using such a near optimal dominating set in place of the LP relaxation and choosing arms from it uniformly at random, we obtain a variant of the original algorithm, which we call -greedy-, which is much more computationally efficient, and which enjoys a similar regret bound:
Theorem 5.1**.**
The regret of -greedy- is at most , where is the maximum degree of the graph.
-greedy- and -greedy-LP have the hyper-parameters and , which control the amount of exploration. We found that its performance is highly sensitive to their choice. In our experiments, we find the optimal values for these parameters by performing a separate grid search for each problem, and only reporting the best results. Since there is no obvious way to tune these parameters online, this leads to a favourable bias in our results for this algorithm. 444A similar observation was made in (?), which noted that an optimally tuned -greedy performs almost always best, but its performance can degrade significantly when the parameters are changed. Although (?) suggests a method for selecting these parameters, we find that using it leads to a near-linear regret.
As Thompson sampling is a Bayesian algorithm, we can view the prior distribution as a hyper-parameter. In our experiments, we always set that to a Beta(1,1) prior for all rewards.
5.2 General experimental setup.
For all of our experiments, we performed independent trials and reported the median-of-means estimator555Used heavily in the streaming literature (?) of the cumulative regret. It partitions the trials into equal groups and return the median of the sample means of each group. We set the number of groups to , so that the confidence interval holds with probability at least .
We also reported the deviation of each algorithm using the Gini’s Mean Difference (GMD hereafter) (?). GMD computes the deviation as with the -th order statistics of the sample (that is ). As shown in (?; ?) the GMD provides a superior approximation of the true deviation than the standard one. To account for the fact that the cumulative regret of our algorithms might not follow a symmetric distribution, we computed the GMD separately for the values above and below the median-of-means.
5.3 Simulated graphs
In our synthetic problems, unless otherwise stated, the rewards are drawn from a Bernoulli distribution whose mean is generated uniformly randomly in except for the optimal arm whose mean is generated randomly in . The number of nodes in the graph is 500. We tested with a sparse graph of 2500 edges and also with a dense graph of 62625 edges.
Erdős–Rényi graphs
In our first experiment, we generate the graph randomly using the Erdős–Rényi model. Figure 2e and 2f respectively show the result in the sparse and dense graph.
Our first observation here is that all policies take advantage of a large number of edges as their cumulative regret is better by using the dense graph (Figure 2f) rather than the sparse one (Figure 2e). This confirms the theoretical result as a dense graph will have a smaller clique cover number than a sparse one.
The policy TS-MaxN outperforms all other in both the sparse and dense graph model. However, the performance of TS-N is very close to that of TS-MaxN in the near complete graph. This is explained by the fact that in a near complete graph we have many cliques. It is revealing to see that TS-N outperforms both the UCB-N and UCB-MaxN policies.
Power Law graph
Such graphs are commonly used to generate static scale-free networks (?). In this experiment, we generated a non-growing random graph with expected power-law degree distribution.
show the results respectively for the dense and sparse graph Figure 2d and 2c show the results respectively for the dense and sparse graph. Again, the policy TS-MaxN clearly outperforms all other. In the sparse graph model, TS-N is beaten by UCB-MaxN at the beginning of the rounds ( ), but catches and ended up beating UCB-MaxN.
Planted Partition Model
The aim of the experiment on this model is to check the dependency on the number of cliques for each policy. Figure 1 shows the results where on the x-axis we have the parameter of the planted partition graph (which is almost equal to the number of cliques) on a graph with 1024 nodes. On the y-axis we have the relative regret of each policy, i.e. the ratio between the regret of each policy with the regret of the best policy when there are two groups, for ease of comparison. As we can see, all methods’ regret scales similarly. Thus, the theoretical bounds appear to hold in practice, and to be somewhat pessimistic. For a larger number of nodes, we would expect the plots to flatten later.
5.4 Social networks datasets
Our experiments on real world datasets follow the methodology described in (?). We first infer a graph from data, and then define a reward function for movie recommendation from user ratings. Missing ratings are predicted using matrix factorization. This enables us to generate rewards from the graph. We explain the datasets, reward function and graph inference in the full version.
Results
Figure 2a shows the results for the Facebook graph and Figure 2b for the Flixster graph. Once again, the Thompson sampling strategies dominate all other strategies for the Facebook and they are matched by the optimised -greedy- policy in the Flixster graph. We notice that in this setting the gap between the UCB policies and the rest is much larger, as is the overall regret of all policies. This can be attributed to the larger size of these graphs.
6 Conclusion
We have presented the first Thompson sampling algorithms for sequential decision problems with graph feedback, where we not only observe the reward of the arm we select, but also those of the neighbouring arms in the graph. Thus, the graph feedback allows us the flexibility to model different types of feedback information, from bandit feedback to expert feedback. Since the structure of the graph need not be known in advance, our algorithms are directly applicable to problems with changing and/or unknown topology. Our analysis leverages the information-theoretic construction of (?), by bounding the expected information gain in terms of fundamental graph properties. Although our problem-independent bound of is not directly comparable to (?), we believe that a problem-independent version of the latter should be , in which case our results would represent an improvement of .
In practice, our two variants always outperform UCB-N, UCB-MaxN, which also use graph feedback but rely on upper confidence bounds. We are also favourably compared against -greedy-, even when we tune the parameters of the latter post hoc.
It would be interesting to extend our techniques to other types of feedback. For example, the Bayesian foundations of Thompson sampling render our algorithms applicable to arbitrary dependencies between arms. In future work, we will analytically and experimentally consider such problems and related applications. Finally, an open question is the existence of information-theoretic lower bounds in settings with partial feedback.
Acknowledgments.
This research was partially supported by the SNSF grants “Adaptive control with approximate Bayesian computation and differential privacy” (IZK0Z2_167522), “Swiss Sense Synergy” (CRSII2_154458), by the the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement number 608743, and the Future of Life Institute.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[Agrawal and Goyal 2012] Agrawal, S., and Goyal, N. 2012. Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 .
- 2[Alon et al . 2015] Alon, N.; Cesa-Bianchi, N.; Dekel, O.; and Koren, T. 2015. Online learning with feedback graphs: beyond bandits. In Proceedings of the 28th Annual Conference on Learning Theory , 23–35.
- 3[Alon, Matias, and Szegedy 1996] Alon, N.; Matias, Y.; and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In 28th STOC , 20–29. ACM.
- 4[Auer, Cesa-Bianchi, and Fischer 2002 a] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002 a. Finite time analysis of the multiarmed bandit problem. Machine Learning 47(2/3):235–256.
- 5[Auer, Cesa-Bianchi, and Fischer 2002 b] Auer, P.; Cesa-Bianchi, N.; and Fischer, P. 2002 b. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2-3):235–256.
- 6[Buccapatnam, Eryilmaz, and Shroff 2014] Buccapatnam, S.; Eryilmaz, A.; and Shroff, N. B. 2014. Stochastic bandits with side observations on networks. ACM SIGMETRICS Performance Evaluation Review 42(1):289–300.
- 7[Burnetas and Katehakis 1997] Burnetas, A. N., and Katehakis, M. N. 1997. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research 22(1):222–255.
- 8[Caron et al . 2012] Caron, S.; Kveton, B.; Lelarge, M.; and Bhagat, S. 2012. Leveraging side observations in stochastic bandits. UAI .
