Fiduciary Bandits
Gal Bahar, Omer Ben-Porat, Kevin Leyton-Brown, Moshe Tennenholtz

TL;DR
This paper introduces a model for recommendation systems that balances exploration and exploitation while respecting users' self-interest, proposing an optimal, incentive-compatible algorithm that ensures users are never worse off by following recommendations.
Contribution
It presents the first asymptotically optimal, incentive-compatible, and individually rational recommendation algorithm under a fiduciary constraint.
Findings
The algorithm guarantees users are never worse off by following recommendations.
It achieves asymptotic optimality in exploration-exploitation tradeoff.
The model ensures recommendations align with users' self-interest.
Abstract
Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether exploration can nevertheless be performed in a way that scrupulously respects agents' interests---i.e., by a system that acts as a fiduciary. More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone. Our main contribution is a positive result: an asymptotically optimal, incentive compatible, and ex-ante individually rational recommendation algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLegal principles and applications
Fiduciary Bandits
Gal Bahar Technion - Israel Institute of Technology ([email protected])
Omer Ben-Porat Technion - Israel Institute of Technology ([email protected]), corresponding author
Kevin Leyton-Brown University of British Columbia, Canada ([email protected])
Moshe Tennenholtz Technion - Israel Institute of Technology ([email protected])
Abstract
Recommendation systems often face exploration-exploitation tradeoffs: the system can only learn about the desirability of new options by recommending them to some user. Such systems can thus be modeled as multi-armed bandit settings; however, users are self-interested and cannot be made to follow recommendations. We ask whether exploration can nevertheless be performed in a way that scrupulously respects agents’ interests—i.e., by a system that acts as a fiduciary. More formally, we introduce a model in which a recommendation system faces an exploration-exploitation tradeoff under the constraint that it can never recommend any action that it knows yields lower reward in expectation than an agent would achieve if it acted alone. Our main contribution is a positive result: an asymptotically optimal, incentive compatible, and ex-ante individually rational recommendation algorithm.
1 Introduction
Multi-armed bandits (henceforth MABs) [9, 10] is a well-studied problem domain in online learning. In that setting, several arms (i.e., actions) are available to a planner; each arm is associated with an unknown reward distribution, from which rewards are sampled independently each time the arm is pulled. The planner selects arms sequentially, aiming to maximize her sum of rewards. This often involves a tradeoff between exploiting arms that have been observed to yield good rewards and exploring arms that could yield even higher rewards. Many variations of this model exist, including stochastic [1, 21], Bayesian [2, 11], contextual [13, 29], adversarial [3] and non-stationary [8, 23] bandits.
This paper considers a setting motivated by recommender systems. Such systems suggest actions to agents based on a set of current beliefs and assess agents’ experiences to update these beliefs. For instance, in navigation applications (e.g., Waze; Google maps) the system recommends routes to drivers based on beliefs about current traffic congestion. The planner’s objective is to minimize users’ average travel time. The system cannot be sure of the congestion on a road segment that no agents have recently traversed; thus, navigation systems offer the best known route most of the time and explore occasionally. Of course, users are not eager to perform such exploration; they are self-interested in the sense that they care more about minimizing their own travel times than they do about conducting surveillance about traffic conditions for the system.
A recent line of work [22, 26], inspired by the viewpoint of algorithmic mechanism design [27, 28], deals with that challenge by incentivizing exploration—that is, setting up the system in such a way that no user would ever rationally choose to decline an action that was recommended to him. The key reason that it is possible to achieve this property while still performing a sufficient amount of exploration is that the planner has more information than the agents. At each point in time, each agent holds beliefs about the arms’ reward distributions; the planner has the same information, but also knows about all of the arms previously pulled and the rewards obtained in each case. More specifically, Kremer et al. [22] consider a restricted setting and devise an MAB algorithm that is incentive compatible (IC), meaning that whenever the algorithm recommends arm to an agent, the best response of the agent is to select arm .
Although this approach explicitly reasons about agents’ incentives, it does not treat agents fairly: agents who are asked to explore receive lower expected rewards. More precisely, in their attempt to reach optimality (in the static setting) or minimize regret (in the online setting), these IC MAB algorithms are intentionally providing (a priori) sub-optimal recommendations to some of the agents. In particular, some of the agents could be better off by not using the system and follow their default arm— the a priori superior arm, which would be every agent’s rational choice in the absence both of knowledge of other agents’ experiences and of a trusted recommendation. Thus, it would be natural for agents to see the recommendations of such IC MAB algorithms as a betrayal of trust; they might ask “why should I trust a recommender that occasionally gives out recommendations it has every reason to believe could make me worse off?”
In this work, we explicitly suggest that a social welfare maximization standpoint might raise societal issues, harming the trust agents put in recommender systems. The central premise of this paper is that explore-and-exploit AI systems should satisfy individual guarantees—guarantees that the system should fulfill for each agent independently from the other agents and their recommendations. At the one end of the spectrum are current MAB algorithms—successful in maximizing welfare, but do not offer the slightest individual guarantee. At the other end is the fiduciary duty: borrowed from law applications, it requires that the mechanism acts in the interest of its clients with all its knowledge. This is the strictest, and strongest, individual guarantee the system could provide. However, if we applied this standard, we would be left only with the mechanism that greedily picks the apparently best arm in each iteration. In some settings, perhaps this is the best that can be achieved; however, note that this mechanism is rarely able to learn anything. It is therefore natural to ask for an approach that enjoys both worlds—maximizing welfare while satisfying individual guarantees.
Our contribution
We explore a novel compromise between these two extreme points, which we call ex-ante individual rationality (EAIR). To motivate it, we consider the benchmark reward of each agent to be that of the default arm: the reward agents would get if the recommender system is unavailable. A mechanism is EAIR if the reward of every recommendation it makes beats that benchmark in expectation, per the mechanism’s knowledge. More technically, a mechanism is EAIR if any probability distribution over arms that it selects has expected reward that is always at least as great as the reward of the default arm, both calculated based on the mechanism’s knowledge (which is more extensive than that of agents). While it is possible for the mechanism to sample a recommendation from a distribution that is a priori inferior to the (realization of the) default arm, the agent receiving the recommendation is nevertheless guaranteed to realize expected reward weakly greater than that offered by the default arm. Satisfying this requirement makes a MAB algorithm more appealing to agents; we foresee that in some domains, such a requirement might be imposed as fairness constraints by authorities.
Algorithmically, we focus on constructing optimal EAIR mechanisms. Our model is a bandit model with arms and agents (rounds). Similarly to Kremer et al. [22], we assume that rewards are fixed but initially unknown.
We consider two agent schemes. In the first part of the paper, we assume that agents follow recommendations, as in the classical MAB literature. This is the case if, e.g., agents are oblivious to some of the actions’ desirability, unaware of the entire set of alternatives, or if the cognitive overload of computing expectations is high. The main technical contribution of this paper is an EAIR mechanism, which obtains the highest possible social welfare by any EAIR mechanism up to an additive factor of . Due to our static setting (rewards are realized only once), following the wrong exploration policy for even one agent has detrimental effect on social welfare. The optimality of our mechanism, which we term Fiduciary Explore & Exploit (FEE) and outline as Algorithm 1, follows from a careful construction of the exploration phase. Our analysis uses an intrinsic property of the setting, which is further elaborated in Theorem 1.
Later on, in Section 4, we adopt a different agent scheme, which is fully aligned with the incentivizing exploration literature. We assume that agents are strategic and have (the same) Bayesian prior over the rewards of the arms. In this context, a mechanism is incentive compatible (IC) if each agent’s expected reward is maximized by the recommended action. We provide a positive result in this challenging case as well. Our second technical contribution is Incentive Compatible Fiduciary Explore & Exploit (IC-FEE), which uses FEE as a black box, and is IC, EAIR and asymptotically optimal.
To complement this analysis, we also propose the more demanding concept of ex-post individual rationality (EPIR). The EPIR condition requires that a recommended arm must never be a priori inferior to the default arm given the planner’s knowledge. The EAIR and EPIR requirements differ in the guarantees that they provide to agents and correspondingly allow the system different degrees of freedom in performing exploration. We design an asymptotically optimal IC and EPIR mechanism. Finally, we analyze the social welfare cost of adopting either EAIR or EPIR mechanisms.
Related work
Background on MABs can be found in Cesa-Bianchi & Lugosi [10] and a recent survey [9]. Despite that many works address MAB rounds as interacting agents, Kremer et al. [22] is the first work of which we are aware that suggests that vanilla algorithms should be modified to deal with agents due to human nature and incentives. The authors considered two deterministic arms, a prior known both to the agents and the planner, and an arrival order that is common knowledge among all agents, and presented an optimal IC mechanism. Cohen & Mansour [14] extended this optimality result to several arms under further assumptions. This setting has also been extended to regret minimization [26], social networks [4, 5], and heterogeneous agents [12, 19]. All of this literature disallows paying agents; monetary incentives for exploration are discussed in e.g., [12, 16]. None of this work considers the orthogonal, societal consideration of individual rationality constraint as we do here.
Our work also contributes to the growing body of work on fairness in Machine Learning [7, 15, 18, 24]. In the context of MABs, some recent work focuses on fairness in the sense of treating arms fairly. In particular, Liu et al. [25] aim at treating similar arms similarly and Joseph et al. [20] demand that a worse arm is never favored over a better one despite a learning algorithm’s uncertainty over the true payoffs. Finally, we note that the EAIR requirement we impose—that agents be guaranteed an expected reward at least as high as that offered by a default arm—is also related to the burgeoning field of safe reinforcement learning [17].
2 Model
Let be a set of arms (actions). Rewards are deterministic but initially unknown: the reward of arm is a random variable , and are mutually independent. We denote by the observed value of . To clarify, rewards are realized only once; hence, once is observed, for the rest of the execution. Further, we denote by the expected value of , and assume for notational convenience that . We also make the simplifying assumption that the rewards are fully supported on the set , and refer to the continuous case in Section 6.
There are agents, who arrive sequentially. We denote by the action of the agent arriving at stage . The reward of the agent arriving at stage is denoted by , and is a function of the arm she chooses. For instance, by selecting arm the agent obtains . Agents are fully aware of the distribution of . Each and every agent cares about her own reward, which she wants to maximize.
A mechanism is a recommendation engine that interacts with agents. The input for the mechanism at stage is the sequence of arms pulled and rewards received by the previous agents. The output of the mechanism is a recommended arm for agent . Formally, a mechanism is a function of course, we can also define a deterministic notion that maps simply to . The mechanism has a global objective, which is to maximize agents’ social welfare .
We consider two agent schemes. The first is non-strategic agents, i.e., agents always follow the recommendation. An underlying assumption of classical MAB algorithms, such behavior could be explicit in case the mechanism makes decisions for the agents; or implicit, e.g., agents are unaware of the entire set of alternatives or their desirability, or high cognitive overload is required to compute it. The second agent scheme is strategic agents: the mechanism makes action recommendations, but cannot compel agents to follow these recommendations. In this scheme, we say that a mechanism is incentive compatible (IC) when following its recommendations is a dominant strategy: that is, when given a recommendation, an agent’s best response is to follow her own recommendation. Formally,
Definition 1** (Incentive Compatibility).**
A mechanism is incentive compatible (IC) if , for every history and for all actions ,
[TABLE]
Unless stated otherwise, we address the non-strategic agents scheme. We handle the other agent scheme in Section 4.
When agents follow the mechanism, we can represent the mechanism’s (expected) social welfare by
[TABLE]
where is the reward agent receives. Notice that depends on the randomness of the rewards and, possibly, the randomness of .
Denote the highest possible social welfare under non-strategic agents by OPT. A mechanism is said to be optimal if . A mechanism is asymptotically optimal if, for every “large enough” number of agents greater than some number , it holds that . This definition of approximation is equivalent to sub-linear regret in the MAB literature.
2.1 Individual Guarantees
An individual guarantee is a guarantee that a mechanism can provide to the agents it interacts with, independently of the other agents. In this subsection, we present our main conceptual contribution: a meaningful individual guarantee that allows exploration.
To put our guarantee in the right context, we first present the strictest and the strongest guarantee that could be provided. A mechanism is a delegate if it acts as the agent would have acted had it revealed the information it has with her. Formally, A mechanism is a delegate if for every agent , every history and every distribution over , it holds that . Indeed, this definition provides the strongest individual guarantee. It characterizes the greedy mechanism, Greedy, which exploits in every round (according to the information it has). Noticeably, Greedy performs little exploration, and probably leads to low social welfare. While sometimes relaxing this strong guarantee is impossible (e.g., banking or health-care), in many situations the planner is willing to relax individual guarantees to favor better social welfare.
The other extreme is to adopt a policy that we term full-exploration. full-exploration is the mechanism that first explores all arms sequentially, and then exploit the best arm. Clearly, at least for the non-strategic agent scheme, full-exploration is optimal when the number of agents is large enough. Nevertheless, with very high probability, it picks sub-optimal arms for the first agents, which can be a highly undesired property.
Our guarantee builds on the popular economic concept of individual rationality. To introduce it, we propose the following thought experiment. Assume that agents have to make decisions without the mechanism. The agents know that ; hence, we shall assume that every agent’s default action is .111As it will become apparent later, if agents have different default arms the social welfare can only increase since more arms could be explored. The default action is the action each agent would have selected if she did not use the mechanism. We compare the two options: picking the default arm or following the mechanism’s action. If a mechanism guarantees that the latter is higher in expectation according to its knowledge, agents are better off using the mechanism. As a result, an individually rational mechanism should guarantee each agent at least the reward obtained by her default action. The next definition relies on this reasoning.
Definition 2** (Ex-Ante Individual Rationality).**
A mechanism is ex-ante individually rational (EAIR) if for every agent , and every history ,
[TABLE]
The EAIR definition is conditioned on histories, i.e., the mechanism’s knowledge. The right hand side is what an agent would get, given the knowledge of the mechanism, if she follows the default arm (which is optimal according to her knowledge). The left hand side is the expected value (over lotteries selected by the mechanism and reward distribution) guaranteed by the mechanism. Due to the mutual independence assumption, we must have if arm was observed under the history and otherwise. An EAIR mechanism must select a portfolio of arms with expected reward never inferior to the reward of the default arm .
Example.
We now give an example to illustrate our setting and to familiarize the reader with our notation. Consider arms, and ; thus , , and . As always, is the default arm. To satisfy EAIR, a mechanism should recommend to the first agent, since EAIR requires that the expected value of any recommendation should weakly exceed . Let be the history after the first agent. Now, we have three different cases. First, if , we know that and ; therefore, an EAIR mechanism can never explore any other arm, since any distribution over would violate Inequality (3). Second, if , then and , and hence an EAIR mechanism can explore both and .
The third and most interesting case is where , as when . In this case, arm could only be recommended through a portfolio. An EAIR mechanism could select any distribution over that satisfies Inequality (3): any such that . This means that an EAIR mechanism can potentially explore arm , yielding higher expected social welfare overall than simply recommending a non-inferior arm deterministically.
3 Asymptotically Optimal EAIR Mechanism
In this section, we consider the case of non-stratgic agents. We present the main technical contribution of this paper: a mechanism that asymptotically optimally balances the explore-exploit tradeoff while satisfying the EAIR property. The mechanism, which we term Fiduciary Explore & Exploit (FEE), is described as Algorithm 1. FEE is an event-based protocol that triggers every time an agent arrives. We now give an overview of FEE, focusing on the case where all agents adopt the recommendation of the mediator (we treat the other case in Section 4). We explain the algorithm’s exploration phase in Subsection 3.1, describe the overall algorithm in Subsection 3.2, and prove the algorithm’s formal guarantees in Subsection 3.3. We provide a comprehensive example of the way FEE operates in Section F.
FEE is composed of three phases: primary exploration (Lines 1–6), secondary exploration (Lines 7–18), and exploitation (Lines 19). During the primary exploration phase, the mechanism compares the default arm to whichever other arms are permitted by the individual rationality constraint. This turns out to be challenging for two reasons. First, the order in which arms are explored matters; tackling them in the wrong order can reduce the set of arms that can be considered overall. Second, it is nontrivial to search in the continuous space of probability distributions over arms. To address this latter issue, we present a key lemma that allows us to use dynamic programming and find the optimal exploration policy in time . Because we expect either to be fixed or to be significantly smaller than , this policy is computationally efficient. Moreover, we note that the optimal exploration policy can be computed offline prior to the agents’ arrival.
The primary exploration phase terminates in one of two scenarios: either the reward of arm is the best that was observed and thus no other arm could be explored (as in our example when , or when and exploring yielded and thus could not be explored), or another arm was found to be superior to : i.e., an arm was observed for which . In the latter case, the mechanism gains the option of conducting a secondary exploration, using arm to investigate all the arms that were not explored in the primary exploration phase. The third and final phase—to which we must proceed directly after the primary exploration phase if that phase does not identify an arm superior to the default arm—is to exploit the most rewarding arm observed.
Remark. In this section we assume that agents are non-strategic and follow the mechanism’s recommendation.
3.1 Primary Exploration Phase
Performing primary exploration optimally requires solving a planning problem; it is a challenging one, because it involves a continuous action space and a number of states exponential in and . We approach this task as a Goal Markov Decision Process (GMDP) (see, e.g., [6]) that abstracts everything but pure exploration. In our GMDP encoding, all terminal states fall into one of two categories. The first category is histories that lead to pure exploitation of , which can arise either because EAIR permits no arm to be explored or because all explored arms yield rewards inferior to the observed ; the second is histories in which an arm superior to was found. Non-terminal states thus represent histories in which it is still permissible for some arms to be explored. The set of actions in each non-terminal state is the set of distributions over the non-observed arms (i.e., portfolios) corresponding to the history represented in that state, which satisfy the EAIR condition. The transition probabilities encode the probability of choosing each candidate arm from a portfolio; observe that the rewards of each arm are fixed, so this is not a source of additional randomness in our model. GMDP rewards are given in terminal nodes only: either the observed if no superior arm was found or the expected value of the maximum between the superior reward discovered and the maximal reward of all unobserved arms (since in this case, as we show later on, the mechanism is able to explore all arms w.h.p. during the secondary exploration phase).
Formally, the GMDP is a tuple , where
- •
is a finite set of states. Each state is a pair , where is the set of arm-reward pairs that have been observed so far, with each appearing at most once in (since rewards from the arms are deterministic): for every and every , . is the set of arms not yet explored. The initial state is thus . For every non-empty222Due to the construction, every non-empty must contain for some . set of pairs we define to be the reward observed for arm , and to be the maximal reward observed.
- •
is an infinite set of actions. For each , is defined as follows:
If , then : i.e., a deterministic selection of . 2. 2.
Else, if , then . This condition implies that we can move to secondary exploration. 3. 3.
Otherwise, is a subset of , such that if and only if
[TABLE]
Notice that this resembles the EAIR condition given in Inequality (3). Moreover, the case where none of the remaining arms have strong enough priors to allow exploration falls here as a vacuous case of the above inequality.
We denote by the set of terminal states, namely .
- •
is the transition probability function. Let , and let such that and for some . Then, the transition probability from to given an action is defined by If is some other state that does not meet the conditions above, then let for every .
- •
is the reward function, defined on terminal states only. For each terminal state ,
[TABLE]
That is, when was the highest-reward arm observed, the reward of a terminal state is ; otherwise, it is the expectation of the maximum between and the highest reward of all unobserved arms. The reward depends on unobserved arms since the secondary exploration phase allows us to explore all these arms; hence, their values are also taken into account.
A policy is a function from all GMDP histories (sequences of states and actions) and a current state to an action. A policy is valid if for every history and every non-terminal state , . A policy is stationary if for every two histories and a state , . When discussing a stationary policy, we thus neglect its dependency on , writing .
Given a policy and a state , we denote by the expected reward of when initialized from , which is defined recursively from the terminal states:
[TABLE]
We now turn to our technical results. The following lemma shows that we can safely focus on stationary policies that effectively operate on a significantly reduced state space.
Lemma 1**.**
For every policy there exists a stationary policy such that (1) for every pair of states and with and ; and (2) for every state s, .
Lemma 1 tells us that there exists an optimal, stationary policy that selects the same action in every pair of states that share the same unobserved set and values and , but are distinguished in the component. Thus, we do not need a set of states whose size depends on the number of possible arm-reward observation histories: all we need to record is and a real value for either and , reducing the number of states to .
We still have one more challenge to overcome: the set of actions available in each state is infinite. Despite that is a convex polytope and thus we can apply Linear Programming, our approach is much more computationally efficient and interpretable. We prove that there exists an optimal “simple” policy, which we denote . Given two indices , we denote by (for ) and by (for ) the distributions over such that
[TABLE]
and if and only if . When is clear from context, we omit it from the superscript.
We are now ready to describe the policy , which we later prove to be optimal. For the initial state , . For every non-terminal state with , such that maximize
[TABLE]
The optimality of follows from a property that is formally proven in Theorem 1: any policy that satisfies the conditions of Lemma 1 can be presented as a mixture of policies that solely take actions of the form . As a result, we can improve by taking the best such policy from that mixture. We derive via dynamic programming, where the base cases are the set of terminal states. For any other state, is the best action of the form as defined above, considering all states that are reachable from . While any policy can be encoded as a weighted sum over such “simple” policies, is the best one, and hence is optimal.
Theorem 1**.**
For every valid policy and every state , it holds that .
Since our compressed state representation consists of states, the computation of in each stage requires us to consider candidate actions, each of which involves summation of at most summands; thus, can be computed in time.
3.2 Intuitive Description of FEE
We now present the FEE algorithm, stated formally as Algorithm 1. The primary exploration phase (Lines 1–6) is based on the GMDP from the previous subsection. It is composed of computing and then producing recommendations according to its actions, each of which defines a distribution over (at most) two actions. Let denote the terminal state reached by (the primary exploration selects a fresh arm in each stage; hence such a state is reached after at most agents).
We then enter the secondary exploration phase. If then this phase is vacuous: no distribution over the unobserved arms can satisfy the EAIR condition and/or all the observed arms are inferior to arm . On the other hand, if (Line 7), we found an arm with a reward superior to , and can use it to explore all the remaining arms. For every , the mechanism operates as follows. If the probability of yielding a reward greater than is zero, we neglect it (Lines 11–13). Else, if , we recommend . This is manifested in the second condition in Line 15. Otherwise, . In this case, we select a distribution over that satisfies the EAIR condition and explore with the maximal possible probability, which is . As we show formally in the proof of Lemma 2, the probability of exploring in this case is at least , implying that after tries in expectation the algorithm would succeed to explore .
Ultimately (Line 19), FEE recommends the best observed arm to all the remaining agents.
3.3 Algorithmic Guarantees
We begin by arguing that FEE is indeed EAIR.
Proposition 1**.**
FEE* satisfies the EAIR condition.*
The proof of Proposition 1 is highly intuitive: the reward of every recommendation FEE makes always exceed in expectation. We now move on to consider the social welfare of FEE. Let denote the highest welfare attained by any EAIR mechanism. First, we show that the expected value of at , denoted by , upper bounds the social welfare of any EAIR mechanism.
Theorem 2**.**
It holds that .
The proof proceeds by contradiction: given an EAIR mechanism , we construct a series of progressively-easier-to-analyze EAIR mechanisms with non-decreasing social welfare; we modify the final mechanism by granting it oracular capabilities, making it violate the EAIR property and yet preserving reducibility to a policy for the GMDP of Subsection 3.1. We then argue via the optimality of that the oracle mechanism cannot obtain a social welfare greater than . Next, we lower bound the social welfare of FEE.
Lemma 2**.**
**
The proof relies mainly on an argument that the primary and secondary explorations will not be too long on average: after agents the mechanism is likely to begin exploiting. Noting that the lower bound of Lemma 2 asymptotically approaches the upper bound of Theorem 2, we conclude that FEE is asymptotically optimal.
4 Incentive Compatibility
In this section, we consider the second and more challenging agent scheme: strategic agents. Our main goal is to show that FEE, which we developed in Section 3.3, can be modified to satisfy IC as well.333 For simplicity, we formulated IC-FEE to satisfy IC in the best response sense: given that all other agents follow their recommendations, it is an agent’s best response to adopt the recommendation as well. However, IC-FEE can be easily modified to offer dominant strategies to agents. We remark that there are cases that an IC mechanism cannot explore all arms, regardless of individual rationality constraints. To illustrate, assume that , i.e., the reward of arm is always greater or equal to the expected reward of arm . In this case, no agent will ever follow a recommendation for arm . Consequently, we shall make the following standard assumption (see, e.g., [26])
Assumption 1**.**
For every such that , it holds that .
If Assumption 1 does not hold for some pair , arm would never be explored; hence, we can remove such arms from . We shall also make the simplifying assumption that , as otherwise the problem becomes easier to solve.
Among other factors, the expectation in Inequality (1) is taken over agents’ information on the arrival order. On the one extreme, the arrival order could be uniform, i.e., each agent is entirely oblivious about her ”place in line.” In this case, as we show in Section E, FEE satisfies IC as is assuming that there are sufficiently many agents. On the other extreme, which is the more popular in prior work [22, 26], agents have complete information about their rounds. Namely, the agent arriving at time knows that she is the ’th agent. The complete information case is the more demanding one, and an IC mechanism for this case will also be IC under any distributional assumption on the arrival order. Nevertheless, as we demonstrate shortly, it requires more technical work.
We build on the techniques of Mansour et al. [26] and use phases: each phase contains one round of exploration (that is, following FEE) and the other rounds are either exploitation via Greedy (defined in Subsection 2.1) or recommendation of arm . An IC version of FEE, which we term IC-FEE, is outlined as Algorithm 2.
IC-FEE works as follows. It initializes an instance of FEE, and uses it seldom in the earlier rounds, and regularly afterward (every time IC-FEE makes a recommendation, it updates FEE). In Line 2, it recommends to the first agent. Recall that employed by FEE is only allowed to pick w.p. 1 in the first round; hence, FEE and IC-FEE coincide with the first recommendation. Then, depending on the value of , it recommends agents either greedily (maximizing the reward in each round, Line 3) or arm (Line 4). Later, in Line 5, it splits the remaining rounds to phases of size ( will be determined later on). In each such phase , we first ask whether FEE is exploring or exploiting (Line 7). If FEE exploits (Line 19 in Algorithm 1), every agent of every phase from here on will be recommended by FEE. If that is not the case (see the else block starting at Line 9), IC-FEE picks one agent from the agents of this phase uniformly at random, denoted . Then, agent gets the recommendation from FEE. The recommendation policy for the rest of the agents in this phase depends on the observed arms. If IC-FEE already discovered an arm with (Line 11), we let agent exploit using Greedy. Otherwise (Line 12), IC-FEE recommends .
Lines 11 and 12 are also where our mechanism departs from the principles of prior work. For example, in the work of Mansour et al. [26], each phase contains one round of exploration and the rest are exploitation rounds. In our setting, agents that are not exploring might still not exploit. The distinction between Lines 11 and 12 is crucial: exploiting unobserved arms might lead to sub-optimal welfare, since they are the chance to explore arms with expected reward below . We elaborate more in the proof of Theorem 3.
To determine the phase length , we introduce the following quantities and . Due to Assumption 1, there exist and such that for all , it holds that . In words, it says that the reward of every arm is greater than all other arms by at least , w.p. of at least . The following Theorem 3 summarizes the properties of IC-FEE.
Theorem 3**.**
Let the phase length be . Under Assumption 1, IC-FEE satisfies EAIR and IC. In addition,
5 Further Analysis
Notice that EAIR mechanisms guarantee each agent the value of the default arm, but only in expectation. We now propose a more strict form of individual rationality, ex-post individual rationality (EPIR).
Definition 3** (Ex-Post Individual Rationality).**
A mechanism is ex-post individually rational (EPIR) if for every agent , every history , and every arm such that , it holds that
Satisfying EPIR means that the mechanism never recommends an arm that is a priori inferior to arm given the mechanism’s knowledge. It is immediate to see that every EPIR mechanism is also EAIR. EPIR mechanisms are quite conservative, since they can only explore arms that yield expected rewards of at least the value obtained for . We develop an optimal IC/EPIR mechanism in Section D.1.
5.1 Social Welfare Analysis
We now analyze the loss in social welfare due to individual rationality constraints. For simplicity, we consider the case of non-strategic agents. Recall that OPT is the highest possible social welfare, and is its counterpart after imposing EAIR. In addition, let and denote the best asymptotic social welfare (w.r.t. some instance and infinitely many agents) achievable by an EPIR and a delegate mechanisms, respectively. Noticeably, for every instance , it holds that . In the rest of this subsection, we analyze the ratio of two subsequent optimal welfares. We begin by showing that individual guarantees can deteriorate welfare even for the most flexible notion, EAIR.
Proposition 2**.**
For every , there exists an instance with .
Proposition 2 shows that when and have the same magnitude, the ratio is on the order of , meaning that EAIR mechanisms perform poorly when a large number of different reward values are possible. However, this result describes the worst case; it turns out that optimal EAIR mechanisms have constant ratio under some reward distributions. For example, as we show in Proposition 7 this ratio is at most if for every and is only slightly better a-priori.
Next, we consider the cost of adopting the stricter EPIR condition rather than EAIR. As Proposition 3 shows, by providing a more strict fiduciary guarantee the social welfare may be harmed by a factor of .
Proposition 3**.**
For every , there exists an instance with .
Finally, we show that the EPIR guarantee still allows us to significantly improve upon .
Proposition 4**.**
For every , there exists an instance with .
6 Conclusions and Discussion
This paper introduces a model in which a recommender system must manage an exploration-exploitation tradeoff under the constraint that it may never knowingly make a recommendation that will yield lower reward than any individual agent would achieve if he/she acted without relying on the system.
We see considerable scope for follow-up work. First, from a technical point of view, our algorithmic results are limited to discrete reward distributions. One possible future direction would be to present an algorithm for the continuous case. More conceptually, we see natural extensions of EPIR and EAIR to stochastic settings, either by assuming a prior and requiring the conditions w.r.t. the posterior distribution or by requiring the conditions to hold with high probability. Moreover, we are intrigued by non-stationary settings—where e.g., rewards follow a Markov process—since the planner would be able to sample a priori inferior arms with high probability assuming the rewards change fast enough, thereby reducing regret.
Acknowledgements
We thank the participants of the Computational Data Science seminar at Technion – Israel Institute of Technology and the participants of Young Researcher Workshop on Economics and Computation for their comments and suggestions. Additionally, we thank ICML 2020 anonymous reviewers who provided comments that improved the manuscript. The work of G. Bahar, O. Ben-Porat and M. Tennenholtz is funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 740435). The work of K. Leyton-Brown is funded by the NSERC Discovery Grants program, DND/NSERC Discovery Grant Supplement, Facebook Research and Canada CIFAR AI Chair Amii. Part of this work was done while K. Leyton-Brown was a visiting researcher at Technion – Israel Institute of Science and was partially funded by the European Union’s Horizon 2020 research and innovation programme (grant agreement n 740435).
Appendix A Omitted Proofs from Subsection 3.1
Proof of Lemma 1.
The proof follows from Propositions 5 and 6 below. ∎
Proposition 5**.**
For every non-stationary policy , there exists a stationary policy such that for every state , .
Moreover, the following Proposition 6 implies that we can substantially reduce the state space by disregarding the observed part and
Proposition 6**.**
For every stationary policy there exists a stationary policy such that:
* for every pair of states with and .* 2. 2.
for every state s, .
Proof of Proposition 5.
Fix an arbitrary non-stationary policy . We prove the claim by iterating over all states in an increasing order of the number of elements in . We use induction to show that the constructed indeed satisfies the assertion.
For every such that , i.e., . If is terminal, then and . Otherwise, the unique element in is the action that assigns probability 1 to , and by setting we get .
Assume that the assertion holds for every ; namely, that for all with . We now prove the assertion for with . If is a terminal state, then we are done. Else, since and the support of each arm are finite, there exists a finite number of possible histories that lead from to that we will mark as . For every possible history , assigns an action that (can) depend on the history . Let
[TABLE]
breaking ties arbitrarily. We set . Hence we get:
[TABLE]
hence, . This concludes the proof. ∎
Proof of Proposition 6.
The proof is similar to the proof of Proposition 5, and is given for completeness. Fix an arbitrary stationary policy . We prove the claim by iterating over all states in an increasing order of the number of elements in . We use induction to show that the constructed indeed satisfies the assertion.
For every such that , i.e., , if is terminal then . Otherwise, the unique element in is the action that assigns probability 1 to ; hence, by setting we get .
Assume the assertion holds for every ; namely, that for all with . Next, we prove the assertion for with . If is a terminal state, then we are done. Else, since the size of and the support of each arm are finite, there exists only a finite number of states with the same and , which we mark as . For every state , assigns an action . Let
[TABLE]
breaking ties arbitrarily. Next, set . We have that
[TABLE]
∎
Proof of Theorem 1.
Fix an arbitrary policy . We prove the claim by iterating over all states in an increasing order of the number of elements of . We use induction to show that the constructed indeed satisfies the assertion. For convenience, we restate elaborately in Algorithm 3.
For every such that , the claim holds trivially. To see this, recall that if is terminal, ; otherwise, the unique element in is the action that assigns probability 1 to the sole element in . Either way, .
Assume the assertion holds for every ; namely, that for all with . If is a terminal state, then we are done. Else, we shall make use of the following claim, which shows that every action in can be viewed as a weighted sum over the elements of .
Claim 1**.**
For any and , there exist coefficients such that
- •
,
- •
, and
- •
.
The proof of the claim appears below this proof. In particular, Claim 1 suggests that , which is valid and thus w.p. 1, can be presented as a weighted sum over all pairs . Finally,
[TABLE]
where the last equality follows since and by the definition of given in Equation (7). To sum, the constructed satisfies for every state . ∎
Proof of Claim 1.
To ease readability, we shall use the notation and in this proof. Let be an arbitrary state and be an arbitrary action. Notice that could be described as
[TABLE]
where and for every such that . We now describe a procedure that shifts mass from the set to , while still satisfying the equality in Equation (8). Each time we apply this procedure we decrease the value of one or more elements from and increase one or more elements from by the same quantity. As a result, when it converges (assuming that it does), namely when , we are guaranteed that all the conditions of the claim hold. Importantly, throughout the course of this procedure, the following inequalities hold
[TABLE]
[TABLE]
For the initial set of Equations (9)-(10) trivially hold due to the way we initialize and since implies that
[TABLE]
In each step of the procedure, we use the prime notation to denote the coefficients in the end of that step. The procedure operates as follows:
- •
If for every , the claim holds.
- •
Else, if for every such that , , then for every with set and set . Notice that after this change Equations (8)–(10) still hold.
- •
There exists with and . Consequently, since Equation (9) holds, there must exist such that and . We divide the analysis into three sub-cases, depending on the relation between and .
: we replace , and with , and such that , and . Clearly, after this modification the new coefficients are non-negative. To show that Equation (8) still holds, we need to show that can be decomposed using the new coefficients. Notice that
[TABLE]
Similarly,
[TABLE]
As a result, Equation (8) holds. As for Equation (9), observe that
[TABLE]
hence, Equation (9) holds. Finally, while all other coefficients are left unchanged; thus Equation (10) holds as well. 2. 2.
: the analysis is similar to the previous case and hence omitted. 3. 3.
: the analysis is similar to the first case and hence omitted.
This concludes the proof. ∎
Appendix B Omitted Proofs from Subsection 3.3
Proof of Proposition 1.
We need to show that Inequality (3) holds for every history . Since FEE operates in phases, it would be convenient to divide the arguments into these three phases, according to which phase belongs.
- •
Exploration phase: the recommendation is based on the action of , the optimal policy of the GMDP in Subsection 3.1. If is the empty history, then it is translated to , and selects w.p. 1. Otherwise, due to Equation (4) the action space of the GMDP is restricted to distributions over the unobserved arms with expectation greater or equal to the observed value . As a result, in both cases Inequality (3) holds.
- •
Experience phase: in this phase, FEE is a distribution over two arms, and , with greater than the obtained value of arm . Further, with positive probability, or otherwise arm would have been discarded (Lines 11–13). If, in addition, , then the If sentence in Line 15 would select arm with probability 1, satisfying Inequality (3). On the other hand, if , then FEE selects arm w.p. , and with the remaining probability (Lines 14–19); hence expected value of FEE is
[TABLE]
which is greater or equal to .
- •
Exploit phase: in this phase FEE is a deterministic selection of one arm — the most rewarding one. Since the value of arm , was observed before (as mentioned for the exploration phase), the arm selected in Line 19, satisfies .
∎
B.1 Optimality
Proof of Theorem 2.
To facilitate the proof, we introduce the following definitions: given a mechanism and a history , we say that is fruitless w.r.t. if gives a positive probability to at least one observed arm , , with , i.e., reward that is at most (notice that it implies that and were observed). In addition, we say that a history is auspicious if an action with reward greater than that of is observed under .
We are ready to begin the proof. Let be an arbitrary mechanism, and for the sake of the proof fix the number of agents, and only consider histories of length of at most . The proof contains three steps. In Step 1 we slightly modify , resulting in a new mechanism that attains a social welfare at least as high as that of , and is still EAIR. In Step 2, we modify to use an oracle whenever it reaches an auspicious history. As we show, the resulting mechanism, has an improved social welfare, . Finally, in Step 3 we show that the social welfare of is at most .
Step 1:
In this step we construct a modification of with at least the same social welfare, which is not fruitless on any history . We define a mechanism that receives as a black box and uses it for recommendations. is defined as follows:
Let be the empty history. Act as and update accordingly. 2. 2.
While the length of is less than :
- 2.1
Draw . If the reward of was already observed and , recommend and set . Else, act as and update accordingly.
It is straightforward to see that satisfies the EAIR condition, and that .
Step 2:
In this step, we present a non-feasible mediator that modifies the way operates on auspicious histories. uses an oracle that hints the best arm.
More concretely, is defined as follows:
Let be the empty history. Act as and update accordingly. 2. 2.
While is not auspicious:
- 2.1
Act as and update accordingly. 3. 3.
If is auspicious:
- 3.1
Use an oracle to reveal the best arm, . From here on, recommend to all users.
Notice that is EAIR for every non-auspicious history, but not EAIR in general; for this reason, it is not feasible. Moreover, it holds that .
Step 3:
The final step is to claim that the resulting mechanism cannot get more than the optimal value of the GMDP in Section 3. However, the GMDP does not allow selecting , so we have to have some minor modifications.
This step is structured as follows. First, formally define a modified version of the GMDP presented in Section 3, with minor modifications. We call the new GMDP Repeated GMDP, or R-GMDP for abbreviation to distinguish between the two. Then, we show that the best achievable value in the R-GMDP is exactly . The final step is mapping obtained in Step 2 to a non-stationary strategy in the R-GMDP, which achieves at least as as the social welfare of , that is . The claim then follows since the policy constructed using cannot obtain more that .
Consider the following R-GMDP: 444The crucial difference between R-GMDP and GMDP is in the action space and the transition probabilities, colored in red for readability.
- •
is a finite set of states. Each state is a pair , where is the set of arm–reward pairs that have been observed so far. is the set of arms not yet explored. The initial state is thus . For every non-empty set of pairs we define to be the reward observed for arm (that can be obtained several times, as we explain shortly), and to be the maximal reward observed.
- •
is an infinite set of actions. For each , is defined as follows:
If , then : i.e., a deterministic selection of . 2. 2.
Else, if , then . 3. 3.
Otherwise, is a subset of , such that if and only if
[TABLE]
We denote by the set of terminal states, namely .
- •
is the transition probability function. Let , and let such that and for some a_{i}\in\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}{U\cup\{a_{1}\}},c\in[H]^{+}.
Then, the transition probability from to given an action is defined by
[TABLE]
If is some other state that does not meet the conditions above, then let for every .
- •
is the reward function, defined on terminal states only. For each terminal state ,
[TABLE]
Next, we prove that there exists an optimal policy for the R-GMDP with a significantly reduced support.
Lemma 3**.**
For every policy for R-GMDP, there exists a stationary policy such that
* for every pair of states and with and .* 2. 2.
For every state s, .
The proof of the lemma is identical to the proof of Lemma 1 and hence omitted. Lemma 3 suggests that we can focus on strategies that distinguish between states based on and solely. The reduced state space does allows self loop by selecting , without having any effect on the reward. It is thus straightforward to see that an optimal strategy that ignores exists, with a reward of exactly .
Notice that defines a non-stationary policy for the R-GMDP, by mimicking the actions (distributions) selects. When gets to an auspicious history or could not explore anymore, the policy gets to a terminal state and obtains a reward. Each time directs an agent, that agent gets at most the maximal reward discovered; hence, is less or equal to the reward obtained by that non-stationary policy , which is at most .
This completes the proof of the theorem. ∎
Proof of Lemma 2.
Let denote the r.v. representing the number of agents in the explore and experience phases, respectively. Notice that the definition of social welfare given in Equation 2 can be interpreted as
[TABLE]
Observe that every agent in the explore and experience phases obtains the reward of arm in expectation. Moreover, every agent in the exploit phase obtains in expectation; hence, Equation (12) can be rearranged as
[TABLE]
To finalize the proof, recall that almost surely since there are arms that could be explored, and on every step in the exploration phase exactly one arm gets explored. Moreover, due to Observation 3 it holds that ; hence,
[TABLE]
∎
Appendix C Omitted Proofs from Section 4
Proof of Theorem 3.
It is immediate to see that IC-FEE is EAIR. Satisfying EAIR follows from mixing FEE, which is EAIR, with Greedy, which satisfies the delegate property and hence also the EAIR constraint, and recommendations of .
Moreover, IC-FEE is asymptotically optimal since, after finitely many agents, its recommendations will coincide with those of FEE, and FEE is asymptotically optimal. While IC-FEE is not exploiting (Line 7), its recommendations coincide with those of FEE at least once per phase. Since the expected exploration time of FEE is (see Lemma 2), IC-FEE explores for rounds in expectation.
Showing that IC-FEE satisfies IC is trickier. We divide the analysis to several parts:
- •
The first agent gets , which is the a-priori best action.
- •
Agents either get recommendations from Greedy (Line 3) or are recommended (Line 4). In the former, agents get the best arm known to the mechanism. In the latter, the only new information agents could learn is that ; thus, for every it holds that
[TABLE]
Agents cannot know if they are being recommended by Line 3 or Line 4, but in both cases they are better off with accepting the recommendation; hence, IC holds for agents .
- •
Agents , and the recommended arm is . This case might be trivial at first glance, but it is not as innocent. IC-FEE can recommend via Lines 8 and 12. In both cases, we know that is the best among all the explored arms. Nevertheless, there could still be unexplored arms with an expected value greater than . One such scenario is when revealed by the first agent yielded . In this case, IC-FEE recommends agent , assuming that she was not selected to be the exploring agent, arm . Nevertheless, according to IC-FEE’s information at that point, arm is the best arm. Recommending greedily might disallow the mechanism to explore more arms using the mixture FEE employs, which leads to sub-optimal social welfare.
Nevertheless, we will show that IC holds in this case as well. Fix an agent and some phase , and assume IC-FEE recommended agent arm . Let denote the event indicating that arms where observed just before agent arrives, and for every . Clearly, agent does not know whether occurs or not, but she can compute the occurrence probability. We have that
[TABLE]
In addition,
[TABLE]
This inequality follows immediately if . Otherwise, if , due to the i.i.d. assumption, could only increase conditioning on ; hence,
[TABLE]
We conclude that agents follow IC-FEE when it recommends .
- •
Agents , and the recommended arm is . Fix an agent and some phase , and assume IC-FEE recommended agent arm . We need to show that for every , it holds that . Due to Assumption 1, there exists such that
[TABLE]
In words, Assumption 1 guarantees that with positive probability , all arms but have a reward that is less than by at least . Denote this event by . If occurs, we are guaranteed that arm will be explored in Line 3. Moreover, denote by the event that agent is the agent selected by IC-FEE to explore in Line 10. We have that
[TABLE]
and the latter is non-negative if .
Overall, we showed that every agent is better off by accepting IC-FEE’s recommendation; hence, IC-FEE is IC. ∎
Appendix D Omitted Proofs and Claims from Section 5
D.1 Ex-Post Individual Rationality
Notice that EAIR mechanisms guarantee each agent the value of the default arm, but only in expectation. We now propose a more strict form of individual rationality, ex-post individual rationality (EPIR).
Definition 4** (Ex-Post Individual Rationality).**
A mechanism is ex-post individually rational (EPIR) if for every agent , every value in the support of , every history , and every arm such that , it holds that
Satisfying EPIR means that the mechanism never recommends an arm that is a priori inferior to arm . Noticeably, every EPIR mechanism is also EAIR, yet EPIR mechanisms are quite conservative, since they can only explore arms that yield expected rewards of at least the value obtained for .
An optimal EPIR mechanism is immediate in case of non-strategic agents; we denote by this intuitive mechanism. First, explore arm , and observe . Then, remove all arms with , and name the obtained set . Then, proceed with full-exploration on .
For the case of strategic agents, is not enough: agents might be reluctant to explore arms with a-priori low rewards. We propose IC-EP-FEE, which is an asymptotically optimal IC and EPIR mechanism. IC-EP-FEE relies on the same technique we use in Section 4 and is outline in Algorithm 4.
Theorem 4**.**
Let the phase length be . Under Assumption 1, IC-EP-FEE satisfies EPIR and IC. In addition,
The proof of Theorem 4 is similar to that of Theorem 3, and is hence omitted.
D.2 Omitted Proofs from Subsection 5.1
Proof of Proposition 2.
Let be such that , and for every such that let
[TABLE]
for a small positive constant . Clearly, while for ; hence, . On the other hand,
[TABLE]
Taking to zero, we get that OPT is arbitrarily close to
[TABLE]
Finally, we use the fact that whenever . By setting and , we conclude that ; therefore,
[TABLE]
∎
Proof of Proposition 3.
Let such that
[TABLE]
It holds that
[TABLE]
On the other hand,
[TABLE]
Taking to zero, we get
[TABLE]
∎
Proof of Proposition 4.
Let such that
[TABLE]
For . It holds that . On the other hand,
[TABLE]
thus, . ∎
Proposition 7**.**
Fix . Let , and let for arbitrarily small . It holds that .
Proof of Proposition 7.
Assume for simplicity that is even. First, by simple probability tricks one can show that
[TABLE]
Second, since for every , any EAIR mechanism must explore first. Notice that ; thus,
[TABLE]
By taking to zero and applying standard manipulations, we obtain
[TABLE]
This term attains for and is monotonically increasing for ; hence, the claim is proven by taking to infinity. ∎
Appendix E Incentive Compatible Mechanism for Strategic Agents and Uniform Arrival
In this section, we consider strategic agents and uniform arrival. Formally, we assume that the agents arrive in a random order, , where is selected uniformly at random from the set of all permutations. We show that FEE satisfies IC as is, assuming that there are sufficiently many agents. We introduce the following quantity . Let . In words, is the probability that arm is superior to all other arms. Clearly, if Assumption 1 holds, for every arm . In addition, let . Lemma 4 implies that if there are agents, then FEE is IC.
Lemma 4**.**
Under Assumption 1 and uniform arrival, if , then FEE is IC.
Proof of Lemma 4.
To prove the statement, we need to show that whenever an agent is recommended arm , her best response is to select arm . We focus on an arbitrary agent, and present the analysis from her point of view. In addition, if , either she is the first agent to arrive at the system or no better arm was discovered, resulting in being a best response. Otherwise, . We define the following events: let be the event indicating that FEE recommends arm to the agent; indicates whether arm was recommended to some agent; and indicates whether is an optimal arm. All of these events are defined w.r.t. the distribution over histories and the agent arrival distribution. Due to the uniform arrival distribution, the probability of matches the proportion of agents who are recommended arm . We proceed by analyzing the odds of being recommended . Due to the definition of and the way FEE works when it observed a superior arm,
[TABLE]
Next, we present a lemma that gives a large deviation bound on the number of agents needed for the experience phase.
Lemma 5**.**
Let . The experience phase terminates after agents w.p. of at least .
The proof of Lemma 5 and other claims we use in this lemma appear just after the end of this proof. For simplicity, denote . Conditioning on , arm is either recommended exactly once (in case its reward is observed to be inferior to another arm during the execution), or several times. The latter can only happen if and arm is used by FEE to explore other, unobserved arms. In this case, Lemma 5 implies that would not happen more than times, w.h.p. As a result,
[TABLE]
Moreover,
Observation 1**.**
For every history such that occur, if FEE already reached the exploit phase (Line 19) under , then .
Due to Observation 1, we also conclude that
[TABLE]
We now analyze the ratio between the probability of arm being optimal and the probability that it is not, given . We have
[TABLE]
Applying the bounds from Equations (15),(16) and (17) to Equation (E), we get
[TABLE]
and by rearranging we obtain
[TABLE]
Next, we bound the expected difference between the reward of arm and that of an arbitrary arm , with . We have
[TABLE]
By plugging in the bound obtained in Equation (19) to Equation (E) we get
[TABLE]
Ultimately, since
Observation 2**.**
Let and . If , it holds that
[TABLE]
The proof is completed by combining Observation 2 with Equation (21) to show that for every arm .
∎
Proof of Lemma 5.
Let denote the number of agents receiving recommendations in the experience phase (Lines 16 and 18). The proof is based on two observations: first, we show that is first-order stochastically dominated by an easy-to-analyze random variable. Then, we use a concentration bound to complete the proof.
Observation 3**.**
For every ,
[TABLE]
Moreover, using Hoeffding’s inequality we have
Claim 2**.**
Let , , and let . It holds that
[TABLE]
By combining Observation 3 and Claim 2, we get
[TABLE]
This completes the proof of this lemma. ∎
Proof of Observation 1.
To see why Observation 1 holds, recall that if occurs, then FEE revealed . Moreover, reaching Line 19 suggests that the experience phase is over; therefore, the rewards of all arms are revealed. Finally, since holds, FEE will pick it with probability 1. ∎
Proof of Observation 2.
First, notice that and ; thus,
[TABLE]
It suffices to show that the right-hand side of Equation (22) is greater or equal to . Now,
[TABLE]
Inserting the values of and , we argue that the statement holds as long as
[TABLE]
To conclude the proof, recall that ; hence
[TABLE]
thus, Equation (24) holds. ∎
Proof of Claim 2.
First, observe that
[TABLE]
Next, notice that ; thus,
[TABLE]
By using the multiplicative version of the Chernoff Bound, we get that
[TABLE]
Recall that ; therefore,
[TABLE]
∎
Proof of Observation 3.
The exploration phase of FEE is based on . Once reaches a terminal state, there are two options:
- •
The terminate state exhibits . In this case, the statement of the If sentence in Line 7 is false, and there is no need for experience. Consequently, w.p. 1 and the statement holds.
- •
The terminate state exhibits . In this case, FEE enters the While loop in Line 8. In each iteration of the While loop, either the size of decreases by 1 (Lines 12 and 16), or stays the same (Line 18). The statement in Line 18 will only execute if the arm selected in Line 10 satisfies , otherwise the If condition in Line 15 would execute; hence, the probability of executing Line 18 is bounded by
[TABLE]
This applies for every iteration of the While loop. Recall that there are at most arms needed to be explored, and hence the statement holds.
∎
E.1 The Full Exploration Mechanism
Proposition 8**.**
Under Assumption 1 and uniform arrival, full-exploration is IC and asymptotically optimal.
Proof of Proposition 8.
Asymptotic optimality is straightforward. The proof of being IC goes along the lines of Theorem 4 and hence omitted. ∎
Appendix F Elaborated Example of FEE
In this section, we provide an elaborated example of the way FEE operates. Consider arms, and ; thus , , , and . As always, is the default arm. Let us assume for the sake of this example that , but these values are not known to the algorithm. We illustrate in Figure 1, obtained from a simple Python program.
Nodes with a square frame are associated with states of the GMDP. The leaves are terminal states, and the intermediate nodes are non-terminal. Blue circled nodes are auxiliary, and separate between values the newly observed arm can take. The outgoing edges from each non-terminal white node are the transition probabilities. For instance, in , the outgoing edges are and , hinting that the action taken in is .
The colored leaves represent terminal states. Green leaves are terminal states where the policy revealed an arm with a value greater than , i.e., (see Line 7 in FEE). Yellow leaves are terminal states in which reveals all the rewards, but those are less or equal to . And the red node, refers to the terminal state in which were explored and were less or equal to , and was not explored. Notice that and are associated with the same state, and since is stationary, their sub-trees are identical. Per our assumption on , the GMDP will reach one of the leaves in , depending on the coin flips. To illustrate, we assume that reached and explain the trajectory.
The root of the tree, , denotes the initial state . Due to the construction of the optimal policy , it will always explore in the first round (level 0 of the tree in Figure 1); thus, FEE recommends the first agent , and observes that (recall we assume the rewards are according to ). The GMDP then transitions to . At , picks . FEE then draw coins (Line 4), which realized with (since we assume the leaf was realized eventually), and selects for the second agent. The value of is then observed, and the GMDP moves . picks , FEE draw coins (Line 4), which realized with , and selects for the second agent. The value of is realized, and the GMDP reaches , which is a terminal state. FEE exists the while loop in Line 3. FEE then enters the if statement of Line 7, since it observe that . At this point, the set of unobserved arms is , and so FEE enters the while loop in Line 8. In Line 9, it sets , following by setting in the subsequent line. Since there is a positive probability that , FEE skips the if block in Line 11.
Then, in Line 14, FEE draws . Since , the second condition of the if block in Line 15 does not hold; hence, the only way to enter the if block in Line 15 is by having . If , FEE moves to Line 18, recommends to the fourth agent, and starts another iteration of the while loop in Line 8. With probability 1, after finitely many agents, FEE will draw . Then, it will recommend in Line 16, and observe . In Line 17, becomes the empty set. FEE will then exit the while loop in Line 8, move to Line 19, and every subsequent agent will exploit—FEE will recommend from then on.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abbasi-Yadkori et al. [2011] Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems , pp. 2312–2320, 2011.
- 2Agrawal & Goyal [2012] Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), , pp. 1–39, 2012.
- 3Auer et al. [1995] Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science , pp. 322–331. IEEE, 1995.
- 4Bahar et al. [2016] Bahar, G., Smorodinsky, R., and Tennenholtz, M. Economic recommendation systems: One page abstract. In Proceedings of the 2016 ACM Conference on Economics and Computation , EC ’16, pp. 757–757, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-3936-0. doi: 10.1145/2940716.2940719 . URL http://doi.acm.org/10.1145/2940716.2940719 .
- 5Bahar et al. [2019] Bahar, G., Smorodinsky, R., and Tennenholtz, M. Social learning and the innkeeper challenge. In ACM Conf. on Economics and Computation (EC) , 2019.
- 6Barto et al. [1995] Barto, A. G., Bradtke, S. J., and Singh, S. P. Learning to act using real-time dynamic programming. Artificial Intelligence , 72(1-2):81–138, 1995.
- 7Ben-Porat & Tennenholtz [2018] Ben-Porat, O. and Tennenholtz, M. A game-theoretic approach to recommendation systems with strategic content providers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada. , pp. 1118–1128, 2018.
- 8Besbes et al. [2014] Besbes, O., Gur, Y., and Zeevi, A. Stochastic multi-armed-bandit problem with non-stationary rewards. In Advances in Neural Information Processing Systems (NIPS) , pp. 199–207, 2014.
