A Dominant Strategy Truthful, Deterministic Multi-Armed Bandit Mechanism   with Logarithmic Regret

Divya Padmanabhan; Satyanath Bhat; Prabuchandran K.J.; Shirish Shevade; and Y. Narahari

arXiv:1703.00632·cs.GT·June 1, 2020

A Dominant Strategy Truthful, Deterministic Multi-Armed Bandit Mechanism with Logarithmic Regret

Divya Padmanabhan, Satyanath Bhat, Prabuchandran K.J., Shirish Shevade, and Y. Narahari

PDF

Open Access

TL;DR

This paper introduces a deterministic multi-armed bandit mechanism with a novel concept of -regret that achieves logarithmic regret in sponsored search auctions by leveraging the typical reward separation between agents.

Contribution

It proposes a new -regret framework and a deterministic, incentive-compatible MAB mechanism that attains logarithmic regret, improving over previous methods with higher regret bounds.

Findings

01

Achieves -regret of O(log T) in sponsored search auctions.

02

Extends results from single to multiple slot auctions.

03

Provides a deterministic, incentive-compatible mechanism.

Abstract

Stochastic multi-armed bandit (MAB) mechanisms are widely used in sponsored search auctions, crowdsourcing, online procurement, etc. Existing stochastic MAB mechanisms with a deterministic payment rule, proposed in the literature, necessarily suffer a regret of $Ω (T^{2/3})$ , where $T$ is the number of time steps. This happens because the existing mechanisms consider the worst case scenario where the means of the agents' stochastic rewards are separated by a very small amount that depends on $T$ . We make, and, exploit the crucial observation that in most scenarios, the separation between the agents' rewards is rarely a function of $T$ . Moreover, in the case that the rewards of the arms are arbitrarily close, the regret contributed by such sub-optimal arms is minimal. Our idea is to allow the center to indicate the resolution, $Δ$ , with which the agents must be distinguished.…

Tables3

Table 1. Table 1: Comparison of our results with state of the art

	Babaioff2014	Our work
Loss studied	Regret	$Δ$ -regret
Additional parameters	None	$Δ$ : tolerance specified by the planner
Mechanism properties	DSIC, deterministic, exploration separated, $O (T^{2 / 3})$ exploration rounds	DSIC, deterministic, exploration separated, $O (\log T)$ exploration rounds
Upper bound on loss	$O (T^{2 / 3})$	$O (\log T)$
Lower bound on loss	$Ω (T^{2 / 3})$	$Ω (\log T)$

Table 2. Table 2: Notations for the single slot SSA setting

Symbol	Description
$K$ , $[K]$	No. of agents and agent set
$μ_{i}$	CTR of agent $i$
$θ_{i}$	Valuation of agent $i$ for each click
$W_{i}$	Social welfare when agent $i$ is allocated
$ρ_{i} (t)$	Click realization of agent $i$ at time $t$
$θ_{m a x}$	Maximum valuation over all agents = $\max_{i} θ_{i}$
$b_{i}$	Bid of agent $i$
$b$	Bid profile of all agents
$b_{- i}$	Bid profile of all agents except agent $i$
$N_{i, t}$	No. of times agent $i$ has been selected till time $t$
$𝒜 (b, ρ, t)$	Allocation at time $t$ for bid profile $b$ and click realization $ρ$
$i_{*}$	Agent with maximum social welfare. Ideally $i_{*}$ must be allocated at every time step
$W_{*}$	Social welfare when agent $i_{*}$ is allocated
$Δ$	Input parameter by center to indicate the level at which the agents must be distinguished
$S_{Δ}$	Set of agents whose social welfare is less than $Δ$ away from $i_{*}$ . These agents do not contribute to $Δ$ -regret.
${\hat{μ}}_{i, t}^{+}$	UCB index corresponding to $μ_{i}$ at time $t$
${\hat{μ}}_{i, t}^{-}$	LCB index corresponding to $μ_{i}$ at time $t$
${\hat{μ}}_{i, t}$	Empirical CTR of agent $i$ estimated from samples up to time $t$
$P_{i}^{t}$	Payment charged to agent $i$ if he is allocated a slot at time $t$ and he gets a click

Table 3. Table 3: Additional notations for multi-slot SSA

Symbol	Description
$M$	No. of slots
$[M]$	Set of $M$ slots = ${1, \dots, M}$
$λ_{m}$	Prominence (Probability with which a user observes an ad at slot $m + 1$ given he has observed the ad at slot $m$ )
$Γ_{m}$	Probability that an ad at slot $m$ is observed
$W_{i, m}$	Social welfare when agent $i$ is allocated slot $m$
$M_{i, t}^{(m)}$	No. of times agent $i$ has been alloted slot $m$ till time $t$
$N_{i, t}$	No. of times agent $i$ has been selected till time $t$ over all slots
$K^{(m)}$	Optimal agent for slot $m$
$W_{*, m}$	Social welfare when agent $K^{(m)}$ is allocated slot $m$
$S_{Δ, m}$	Set of agents whose social welfare is less than $Δ$ away from $K^{(m)}$ . These agents do not contribute to $Δ$ -regret when allocated slot $m$ .

Equations120

Δ -regret = t = 1 \sum T (W_{*} - W_{I_{t}}) \mathbbm 1 [I_{t} \in [K] ∖ S_{Δ}]

Δ -regret = t = 1 \sum T (W_{*} - W_{I_{t}}) \mathbbm 1 [I_{t} \in [K] ∖ S_{Δ}]

u_{i} (b_{i}, b_{- i}, ρ, t; θ_{i}) = (θ_{i} - P_{i}^{t} (b, ρ)) A_{i} (b_{i}, b_{- i}, ρ, t) ρ_{i} (t)

u_{i} (b_{i}, b_{- i}, ρ, t; θ_{i}) = (θ_{i} - P_{i}^{t} (b, ρ)) A_{i} (b_{i}, b_{- i}, ρ, t) ρ_{i} (t)

W_{i, t}^{+} = μ_{i, t} θ_{i} + ϵ_{i, t} θ_{i} = μ_{i, t} θ_{i} + 2 \frac{θ _{i}^{2} lo g T}{N _{i, t}}

W_{i, t}^{+} = μ_{i, t} θ_{i} + ϵ_{i, t} θ_{i} = μ_{i, t} θ_{i} + 2 \frac{θ _{i}^{2} lo g T}{N _{i, t}}

W_{i, t}^{-} = μ_{i, t} θ_{i} - ϵ_{i, t} θ_{i} = μ_{i, t} θ_{i} - 2 \frac{θ _{i}^{2} lo g T}{N _{i, t}}

W_{i, t}^{-} = μ_{i, t} θ_{i} - ϵ_{i, t} θ_{i} = μ_{i, t} θ_{i} - 2 \frac{θ _{i}^{2} lo g T}{N _{i, t}}

Δ^{2} > \frac{8 θ _{ma x}^{2} lo g T}{N _{i, t}} \geq \frac{8 θ _{i}^{2} lo g T}{N _{i, t}} \geq 4 [\frac{2 θ _{i}^{2} lo g T}{N _{i, t}}]

Δ^{2} > \frac{8 θ _{ma x}^{2} lo g T}{N _{i, t}} \geq \frac{8 θ _{i}^{2} lo g T}{N _{i, t}} \geq 4 [\frac{2 θ _{i}^{2} lo g T}{N _{i, t}}]

P (G)

P (G)

= 1 - P (t ⋃ i ⋃ B_{i, t}) = 1 - t \sum i \in [K] \sum P (B_{i, t})

\geq 1 - t \sum i \in [K] \sum 2 T^{- 4} \geq 1 - \frac{2}{T ^{2}}

W_{i_{*}, t}^{+}

W_{i_{*}, t}^{+}

P (W_{i, t}^{+} >

P (W_{i, t}^{+} >

\leq \frac{1}{2} P (B_{i, t}) + \frac{1}{2} P (B_{i_{*}, t}) \leq 2/ T^{- 4}

P (W_{i_{*}, t}^{+}

P (W_{i_{*}, t}^{+}

E

E

= E [t = 1 \sum T (W_{*} - W_{I_{t}}) \mathbbm 1 [I_{t} \in [K] ∖ S_{Δ}] ∣\forall t, \forall i W_{i} \in [W_{i, t}^{-}, W_{i, t}^{+}]]

= E [t = 1 \sum T (W_{*} - W_{I_{t}}) \mathbbm 1 [I_{t} \in [K] ∖ S_{Δ}] ∣ W_{I_{t}} \in [W_{I_{t}, t}^{-}, W_{I_{t}, t}^{+}]]

\leq \frac{8 K θ _{ma x}^{3} lo g T}{Δ ^{2}}

E [Δ -regret ∣ G^{c}] \leq T θ_{ma x}

E [Δ -regret ∣ G^{c}] \leq T θ_{ma x}

E

E

\leq \frac{8 K θ _{ma x}^{3} lo g T}{Δ ^{2}} * 1 + T θ_{ma x} * \frac{2}{T ^{2}}

\leq \frac{8 K θ _{ma x}^{3} lo g T}{Δ ^{2}} + 2

T \to \infty lim inf \frac{E [ Δ -regret ]}{lo g T} \geq i \in / S_{Δ} \sum \frac{Δ _{i}}{k l ( μ _{i} , μ ^{*} + Δ )}

T \to \infty lim inf \frac{E [ Δ -regret ]}{lo g T} \geq i \in / S_{Δ} \sum \frac{Δ _{i}}{k l ( μ _{i} , μ ^{*} + Δ )}

k l (μ_{2}, μ_{2}^{'}) \leq (1 + ϵ) k l (μ_{2}, μ_{1} + Δ)

k l (μ_{2}, μ_{2}^{'}) \leq (1 + ϵ) k l (μ_{2}, μ_{1} + Δ)

\tilde{k l}_{s} = t = 1 \sum s \frac{μ _{2} ρ _{2}^{t} + ( 1 - μ _{2} ) ( 1 - ρ _{2}^{t} )}{μ _{2}^{'} ρ _{2}^{t} + ( 1 - μ _{2}^{'} ) ( 1 - ρ _{2}^{t} )}

\tilde{k l}_{s} = t = 1 \sum s \frac{μ _{2} ρ _{2}^{t} + ( 1 - μ _{2} ) ( 1 - ρ _{2}^{t} )}{μ _{2}^{'} ρ _{2}^{t} + ( 1 - μ _{2}^{'} ) ( 1 - ρ _{2}^{t} )}

C_{T} = \mathbbm 1 {N_{2, T} < \frac{( 1 - ϵ ) lo g T}{k l ( μ _{2} , μ _{2}^{'} )} and \tilde{k l}_{N_{2, T}} \leq (1 - ϵ /2) lo g T)}

C_{T} = \mathbbm 1 {N_{2, T} < \frac{( 1 - ϵ ) lo g T}{k l ( μ _{2} , μ _{2}^{'} )} and \tilde{k l}_{N_{2, T}} \leq (1 - ϵ /2) lo g T)}

P_{μ_{2}^{'}} (C_{T} = 1) = E_{μ_{2}} [C_{T} exp (- \tilde{k l}_{N_{2, T}})] \geq exp (- (1 - ϵ /2) lo g T) \times P_{μ_{2}} (C_{T} = 1)

P_{μ_{2}^{'}} (C_{T} = 1) = E_{μ_{2}} [C_{T} exp (- \tilde{k l}_{N_{2, T}})] \geq exp (- (1 - ϵ /2) lo g T) \times P_{μ_{2}} (C_{T} = 1)

P_{μ_{2}} (C_{T} = 1)

P_{μ_{2}} (C_{T} = 1)

\leq T^{1 - ϵ /2} \frac{E _{μ_{2}^{'}} [ T - N _{2, T} ]}{T - f _{T}} \to 0

P_{μ_{2}} (C_{T} = 1)

P_{μ_{2}} (C_{T} = 1)

= P_{μ_{2}} (N_{2, T} < f_{T} and \frac{k l ( μ _{2} , μ _{2}^{'} )}{( 1 - ϵ ) lo g T} s \leq f_{T} max \tilde{k l}_{s} \leq \frac{k l ( μ _{2} , μ _{2}^{'} )}{( 1 - ϵ )} (1 - ϵ /2))

T \to \infty lim P_{μ_{2}} (\frac{k l ( μ _{2} , μ _{2}^{'} )}{( 1 - ϵ ) lo g T} s \leq f_{T} max \tilde{k l}_{s} \leq \frac{k l ( μ _{2} , μ _{2}^{'} )}{( 1 - ϵ )} (1 - ϵ /2)) = 1

T \to \infty lim P_{μ_{2}} (\frac{k l ( μ _{2} , μ _{2}^{'} )}{( 1 - ϵ ) lo g T} s \leq f_{T} max \tilde{k l}_{s} \leq \frac{k l ( μ _{2} , μ _{2}^{'} )}{( 1 - ϵ )} (1 - ϵ /2)) = 1

E_{μ_{2}} [N_{2, T}] \geq P_{μ_{2}} (N_{2, T} \geq f_{T}) f_{T} = \frac{1 - ϵ}{k l ( μ _{2} , μ _{2}^{'} )} \geq \frac{1 - ϵ}{1 + ϵ} \frac{lo g T}{k l ( μ _{2} , μ _{1} + Δ )}

E_{μ_{2}} [N_{2, T}] \geq P_{μ_{2}} (N_{2, T} \geq f_{T}) f_{T} = \frac{1 - ϵ}{k l ( μ _{2} , μ _{2}^{'} )} \geq \frac{1 - ϵ}{1 + ϵ} \frac{lo g T}{k l ( μ _{2} , μ _{1} + Δ )}

Γ_{m} = ⎩ ⎨ ⎧ 1 s = 1 \prod m - 1 λ_{s} 0 if m = 1 if 2 \leq m \leq M if m > M

Γ_{m} = ⎩ ⎨ ⎧ 1 s = 1 \prod m - 1 λ_{s} 0 if m = 1 if 2 \leq m \leq M if m > M

W_{i, m} = Γ_{m} μ_{i} θ_{i}

W_{i, m} = Γ_{m} μ_{i} θ_{i}

S_{Δ, m} = {i \in [K] : W_{K^{(m)}, m} - W_{i, m} < Δ} .

S_{Δ, m} = {i \in [K] : W_{K^{(m)}, m} - W_{i, m} < Δ} .

Δ -regret = t = 1 \sum T m = 1 \sum M (W_{*, m} - W_{I_{t, m}, m}) \mathbbm 1 [I_{I_{t}, m} \in [K] ∖ S_{Δ, m}]

Δ -regret = t = 1 \sum T m = 1 \sum M (W_{*, m} - W_{I_{t, m}, m}) \mathbbm 1 [I_{I_{t}, m} \in [K] ∖ S_{Δ, m}]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Auction Theory and Applications · Optimization and Search Problems

Full text

∎

11institutetext: † Singapore University of Technology and Design, Singapore

∗ National University of Singapore

‡ Indian Institute of Science (IISc), Bangalore

Dominant Strategy Truthful, Deterministic Multi-armed Bandit Mechanisms with Logarithmic Regret for Sponsored Search Auctions

Divya Padmanabhan*†*

Satyanath Bhat*∗*

Prabuchandran K. J.‡

Shirish Shevade*‡*

Y. Narahari*‡*

(Submitted: December 2018, Revised: May 2020)

Abstract

Stochastic multi-armed bandit (MAB) mechanisms are widely used in sponsored search auctions, crowdsourcing, online procurement, etc. Existing stochastic MAB mechanisms with a deterministic payment rule, proposed in the literature, necessarily suffer a regret of $\Omega(T^{2/3})$ , where $T$ is the number of time steps. This happens because the existing mechanisms consider the worst case scenario where the means of the agents’ stochastic rewards are separated by a very small amount that depends on $T$ . We make, and, exploit the crucial observation that in most scenarios, the separation between the agents’ rewards is rarely a function of $T$ . Moreover, in the case that the rewards of the arms are arbitrarily close, the regret contributed by such sub-optimal arms is minimal. Our idea is to allow the center to indicate the resolution, $\Delta$ , with which the agents must be distinguished. This immediately leads us to introduce the notion of $\Delta$ -Regret. Using sponsored search auctions as a concrete example (the same idea applies for other applications as well), we propose a dominant strategy incentive compatible (DSIC) and individually rational (IR), deterministic MAB mechanism, based on ideas from the Upper Confidence Bound (UCB) family of MAB algorithms. Remarkably, the proposed mechanism $\Delta$ -UCB achieves a $\Delta$ -regret of $O(\log T)$ for the case of sponsored search auctions. We first establish the results for single slot sponsored search auctions and then non-trivially extend the results to the case where multiple slots are to be allocated.

Keywords:

Multi-armed bandit mechanismDSIC Deterministic

1 Introduction

Multi-armed bandit (MAB) algorithms Bubeck2012a are now widely used to model and solve problems where decisions are required to be made sequentially at every time step and there is an exploration - exploitation dilemma. This dilemma is the tradeoff that the planner faces in deciding whether to explore arms that may yield higher rewards in the future or exploit the arms that have already yielded high rewards in the past. If the rewards are generated from fixed distributions with unknown parameters, the setting goes by the name stochastic MAB Bubeck2012a . Popular algorithms in the stochastic MAB setting include Upper Confidence Bound (UCB) based algorithms UCBAuer2002 and Thompson Sampling jmlrShipraThompson based algorithms. These algorithms incur $O(\log T)$ regret where $T$ is the total number of time steps. MAB algorithms are well studied with several variants Kleinberg2010 ; Bubeck2012b ; Kapoor2018 ; chen2013combinatorial and applications Santiago2017 ; PadmanabhanIJCNN16 ; scott2010modern ; dirkx2018optimizing .

When the arms are controlled by strategic agents, we need to tackle additional challenges. Mechanism design NARAHARI2014 ; NISAN2007 ; Nisan2007jair has been applied in this context, leading to stochastic MAB mechanisms Liu2017 . The design of such mechanisms requires ideas from online learning as well as mechanism design, both of which are increasingly gaining importance in the field of artificial intelligence. An immediate application of stochastic MAB mechanisms is in sponsored search auctions (SSA). In SSA, there are several advertisers who wish to display their ads along with the search results generated in response to a query from an internet user. In the standard model, an advertiser has only one ad to display. We use the terms agent, ad, and advertiser interchangeably. There are two components that are of interest to the planner or the search engine, (1) stochastic component: click through rate (CTR) of the ads or the probability that a displayed ad receives a click (2) strategic component: valuation of the agent for every click that the agent’s ad receives. The search engine would seek to allocate a slot to an ad which has the maximum social welfare (product of click through rate and valuation). However neither the CTRs nor the valuations of the agents are known. This calls for a learning algorithm to learn the stochastic component (CTR) as well as a mechanism to elicit the strategic component (valuation). This problem could become much harder as the agents may manipulate the learning process Babaioff2014 ; Jain2016 to gain higher utilities.

For single slot SSA, it is known that any deterministic MAB mechanism (that is, a MAB mechanism with a deterministic allocation and payment rule) suffers a regret Bubeck2012a ; Feldman2014 of $\Omega(T^{2/3})$ Babaioff2014 . Furthermore, there exists a deterministic MAB mechanism with regret matching the theoretical lower bound Babaioff2014 and also satisfies ex-post truthfulness, the strongest notion of truthfulness (a posteriori to the clicks). When a more relaxed notion of truthfulness is targeted (truthfulness in expectation of the clicks), the regret guarantee improves to $O(T^{1/2})$ babaioff2010truthful . Truthfulness in expectation has also been achieved in Gonen2007 ; Pavlov2009 . The regret can be further improved when randomized mechanisms are used and in fact the regret in this space is $O(\log T)$ babaioff2010truthful ; Jain2018 . However, the high variance that is inevitable to the payments in randomized mechanisms is a serious deterrent to the use of randomized mechanisms. Towards reducing the variance, Ghalme2017 propose a MAB mechanism using Thompson sampling jmlrShipraThompson . However the notion of truthfulness achieved is ‘within period DSIC’ and with high probability. Thus again, only a weaker notion of truthfulness is achieved compared to ex-post truthfulness.

In this work, we observe that the characterization provided by Babaioff et al. Babaioff2014 targets the worst case scenario. In particular, in the lower bound proof of regret of $\Omega(T^{2/3})$ , they consider an example scenario where the actual separation, $\bar{\Delta}$ , between the expected rewards of the arms is a function of $T$ . We note that when a similar example ( $\bar{\Delta}=T^{-1}$ ) is used with the popular UCB algorithm UCBAuer2002 , the number of pulls of sub-optimal arms could be linear, even in the non-strategic case. Hence, a dependence of $\bar{\Delta}$ on $T$ is severely restrictive for the case when the rewards are stochastic, even when the arms are non-strategic. We make the observation that $\bar{\Delta}$ is in most situations independent of $T$ . This motivates our main idea in this paper, which is to provide the planner an option to specify a parameter $\Delta$ , which is the tolerance or distinguishing level for sub-optimal arms. The understanding is that any arm that is within $\Delta$ from the best arm will not cause any additional regret to the planner. For example, the best arm may yield expected reward of $6.000$ while a sub-optimal arm may yield a very close expected reward of $5.999$ . The planner is typically indifferent to such small differences. Traditional exploration-separated schemes end up spending a huge number of exploration rounds in order to distinguish between these two closely separated arms.

**Setting the value of $\Delta$ : An Example

** The value of $\Delta$ is set by the central planner depending on how well he would like to distinguish between the arms. For example, consider the case where there are two agents. Agent $1$ has a CTR $\mu_{1}=0.8$ and valuation for every click $\theta_{1}=5$ units. Agent $2$ has a CTR $\mu_{2}=0.3999$ and a valuation for every click $\theta_{2}=10$ . Agent 1 is the more preferred agent as his expected social welfare is $\mu_{1}\theta_{1}=4$ while the expected social welfare for agent 2 is $\mu_{2}\theta_{2}=3.999$ . Then the actual separation between the agents, $\bar{\Delta}=4-3.999=0.001$ . But the planner may be indifferent to such a small difference of $0.001$ in expected social welfare. Therefore he would be satisfied with selecting either of the agents. Hence, he should set the parameter $\Delta$ to any value greater than $0.001$ .

This notion of $\Delta$ tolerance will require an appropriate definition of regret, which we call $\Delta$ -regret. Focussing on $\Delta$ -regret instead of the usual notion of regret helps us to reduce the number of exploration rounds significantly from $O(T^{2/3})$ to $O(\log T)$ . We propose an exploration separated mechanism based on UCB, which achieves a $\Delta$ -regret of $O(\log T)$ . This mechanism can be readily applied in several settings such as SSA, crowdsourcing, and online procurement. For the rest of the paper, however, we use SSA as a running example.

Contributions:

(1) We make the crucial observation that in most MAB scenarios, the separation between the agents’ rewards is rarely a function of $T$ (the number of time steps). Moreover, in the case that the rewards of the arms are arbitrarily close, the regret contributed by such sub-optimal arms is negligible. We exploit this observation to allow the center to specify the resolution, $\Delta$ , with which the agents must be distinguished. We introduce the notion of $\Delta$ -Regret to formalize this regret.

(2) Using sponsored search auctions as a concrete example, we propose a dominant strategy incentive compatible (DSIC) and individually rational (IR) MAB mechanism with a deterministic allocation and payment rule, based on ideas from the UCB family of MAB algorithms. The proposed mechanism $\Delta$ -UCB achieves a $\Delta$ -regret of $O(\log T)$ for the case of single slot sponsored search auctions. The truthfulness achieved by $\Delta$ -UCB is a posteriori to the click realizations and is the strongest form of truthfulness. This loss of $O(\log T)$ would not have been possible otherwise if the traditional notions of regret were used. In particular the number of exploration rounds in $\Delta$ -UCB is $O(\log T)$ as opposed to the $O(T^{2/3})$ rounds which were mandatory so far for ensuring a truthful, deterministic mechanism. Thus we now enable the planner to be relieved from this huge number of exploration rounds. We also show that a lower bound on the $\Delta$ -regret suffered by any mechanism is $\Omega(\log T)$ .

(3) We non-trivially extend the above results to the case where multiple slots are to be allocated. Here again, our mechanism is DSIC, IR, and achieves a $\Delta$ -regret that is $O(\log T)$ .

Our results are generic to stochastic MAB mechanisms and can be applied to other popular applications such as crowdsourcing and online procurement.

2 Relevant Work

In the area of MAB mechanisms, a lot of work has been done in sponsored search auctions. Babaioff et. al.Babaioff2014 provide a characterization of truthful MAB mechanisms, wherein the objective is to maximize social welfare. They introduce the notion of influential rounds. The influential rounds are the rounds where the parameters of reward distributions (CTRs) are learnt. One of the characterizations of truthful deterministic mechanisms is that the allocation must be exploration separated, that is, in such influential rounds, the allocation must not depend on the bids of the agents. The allocation is also required to be point wise monotone. One of the main results of their paper is that any truthful, deterministic MAB mechanism incurs a regret of $\Omega(T^{2/3})$ . In particular, their analysis holds an adversarial nature, as the sub-optimality between the best and second best arm is chosen as if by an adversary, to be proportional to $T^{-1/3}$ . Such a choice ensures a huge regret for any truthful, deterministic mechanism. They also provide a mechanism which incurs a matching upper bound regret of $O(T^{2/3})$ . Devanur et. al. Devanur2009 concurrently provide similar bounds on the regret when the objective is revenue maximization rather than social welfare maximization.

All the above results pertain to the setting of single slot auctions where there is a single slot for which the agents compete. In the generalization of this setting multiple slots are reserved for ads. This setting is more challenging as every slot is not identical and some slots are more prominent than the others. MAB mechanisms have also been extended to the multiple slot setting Gatti2012 in line with the characterization in Babaioff2014 . Hence, a similar regret of $O(T^{2/3})$ on the social welfare has been attained here as well. Similar results are also stated in the characterisation provided in Akash2012 .

MAB mechanisms have also been proposed in the context of crowdsourcing Biswas2015 . Some of these mechanisms incur a regret of $O(\log T)$ . This is rendered possible due to the specific nature of the problem in hand. In particular, Bhat et. al. BhatAAMAS16 look at divisible tasks. Jain et. al. JainAAMAS16 look at deterministic mechanisms where a block of tasks is allocated to each agent and provide a weaker notion of truthfulness.

The lower bound of both of social welfare regret as well as regret in the revenue of $\Omega(T^{2/3})$ have influenced subsequent research to follow similar assumptions and thereby obtain a similar regret. However, we show in this work that it is indeed possible to design a deterministic mechanism which attains logarithmic regret and is also truthful in the dominant strategy incentive compatible (DSIC) sense Myerson1991 . DSIC, of course, is the most preferred form of truthfulness NARAHARI2014 . This work opens up the possibility for a planner to move away from the worst case scenario to a more realistic scenario. We enable the planner to specify a resolution parameter for distinguishing the arms, introduce the notion of $\Delta$ -regret and thereafter propose a mechanism that ensures that the number of exploration rounds and hence the regret suffered is only $O(\log T)$ instead of the expensive $\Omega(T^{2/3})$ available currently in state of the art. We summarize the contrast between our work and the state of the art in Table 1.

3 The Model: Single Slot SSA

We now describe our SSA setting. For ease of reference, our notations are provided in Table 2. Let $K$ be the number of agents or arms. We denote the set of arms by $[K]$ . Each of the $K$ arms, when pulled, gives rewards from distributions with unknown parameters. We assume here, that the form of the distributions are known but the parameters of the distributions are unknown. In SSA, the rewards of the arms correspond to clicks. The clicks for the advertisements are assumed to be generated from Bernoulli distributions with parameters $\mu_{1},\mu_{2},\ldots,\mu_{K}$ where $\mu_{i}$ is the CTR or probability that advertisement $i$ receives a click once observed. The means $\mu_{1},\ldots,\mu_{K}$ are unknown.

A click realization $\rho$ represents the click information of every agent at all rounds, that is, $\rho_{i}(t)=1$ if agent $i$ received a click in round $t$ . In a round $t$ , only the click information of the allocated agent is revealed after the completion of the round. Click information of all other unallocated agents is never known to the planner.

The agents also have their valuations for each click they receive. We work in the ‘pay per click’ setting where the agent pays the search engine for each click received. Let the true valuation of agent $i$ be $\theta_{i}$ for a click. $\theta_{i}$ is a private type of agent $i$ and is never known to the learner. However the agent is asked to bid his valuation. Let the bid of agent $i$ be $b_{i}$ . We denote by a vector $b=(b_{1},\ldots,b_{K})$ the bid profile of all the agents. The central planner wants to ensure that the agents bid their true valuations, that is $b_{i}$ must be equal to $\theta_{i}$ . Assume that there is a single slot which must be allocated to one of the $K$ agents. We denote by $W_{i}$ the social welfare when agent $i$ is allocated a slot, that is, $W_{i}=\mu_{i}\theta_{i}$ . The social welfare represents the expected valuation of agent $i$ per click. If the CTRs of the agents as well as their valuations were known, the planner would have selected the arm with the maximum social welfare, that is, $\mu_{i}\theta_{i}$ . However neither $\mu_{i}$ nor $\theta_{i}$ is known to the planner. Assume $\theta_{max}$ is the maximum valuation that any agent can have and is common knowledge. The central agent wants to allocate a single slot to one of the ads in such a way that the net social welfare of the allocation is maximized.

A mechanism $\mathcal{M}=\left\langle\mathcal{A},P\right\rangle$ is a tuple containing an allocation rule $\mathcal{A}$ and a payment rule $P$ . At every time step or round $t$ , the allocation rule acts on a bid profile $b$ of the agents as well as click realization $\rho$ and allocates the slot to one of the $K$ agents, say $i$ . Then $\mathcal{A}(b,\rho,t)=i$ . Alternatively we denote the indicator variable $\mathcal{A}_{i}(b,\rho,t)=\mathbbm{1}[\mathcal{A}(b,\rho,t)=i]$ . The payment rule $P^{t}=(P_{1}^{t},P_{2}^{t},\ldots,P_{K}^{t})$ , where $P_{i}^{t}(b,\rho)$ is the payment to be made by agent $i$ at time $t$ upon receiving a click, when the bids are $b$ and for click realization $\rho$ . As stated earlier $\rho_{i}(t)$ of the allocated agent alone is observed. Also note that the allocation as well as payments in each round $t$ only depends on the click histories till that round.

Let $i_{*}$ be the arm with the largest social welfare, that is, $i_{*}=\operatorname*{arg\,max}\limits_{i\in[K]}W_{i}$ . We denote the corresponding social welfare as $W_{*}=\max_{i\in[K]}W_{i}$ . We denote by $I_{t}$ the agent chosen at time $t$ as a shorthand for $\mathcal{A}(b,\rho,t)$ . For any given $\Delta>0$ , define the set $S_{\Delta}=\{i\in[K]:W_{*}-W_{i}<\Delta\}$ . $S_{\Delta}$ denotes the set of all agents separated from the best arm $i_{*}$ with a social welfare less than $\Delta$ . These arms are therefore indistinguishable for the center and they contribute zero to the regret. Note that $\Delta$ is a parameter that the center fixes based on the amount in dollars he is willing to tradeoff for choosing sub-optimal arms, given he has only a fixed time horizon $T$ to his disposal. To capture this revised and more practical notion of regret, we introduce the metric $\Delta$ -regret. Formally,

[TABLE]

The center may not want to invest a huge number of exploration rounds ( $\Omega(T^{2/3})$ in state of the art) to perfectly distinguish the arms that are arbitrarily close. Many a time, the planner may instead be willing to allocate arms that are at most $\Delta$ away from the best arm. The center therefore suffers a regret only when an agent with a social welfare greater than $\Delta$ away from $W_{*}$ is chosen. $\Delta$ -regret captures this loss.

The goal of our mechanism is to select agents at every round $t$ to minimize the $\Delta$ -regret.

4 Our Mechanism: $\Delta$ -UCB

We are now ready to describe our mechanism $\Delta$ -UCB. The idea in $\Delta$ -UCB is to explore all the arms in a round-robin fashion for a fixed number of rounds. The number of exploration rounds is fixed based on the desired $\Delta$ , specified by the planner. At the end of exploration, with high probability, we are guaranteed that the arms not in $S_{\Delta}$ are well separated from the best arm $i_{*}$ with respect to their social welfare estimates. In the exploration rounds, agents need not pay and these rounds are free.

Further on, for all the remaining rounds, the best arm as per the UCB estimate of social welfare is chosen. However in the exploitation rounds, the chosen agent pays an amount for each click he receives. The amount to be paid by the agent is fixed based on variant of the well known Vickrey Clark Grove (VCG) scheme vickrey1961counterspeculation known as weighted VCG NISAN2007 . Note that no learning takes place in these rounds and the UCB, LCB indices do not change thereafter. We present our mechanism in Algorithm 1.

4.1 Properties of $\Delta$ -UCB

Next we discuss the properties satisfied by $\Delta$ -UCB regarding truthfulness and regret. Before that, we state a few useful definitions which will help in understanding the notion of truthfulness.

At any time step, every agent obtains some utility by participating in the mechanism. This utility is a function of his bid, valuation, bids of other agents and his click realization. Let $\Theta_{i}$ denote the space of bids of agent $i$ . $b_{-i}=(b_{1},\ldots,b_{i-1},b_{i+1},\ldots,b_{K})$ is the bid profile containing bids of all agents except agent $i$ . Let $\Theta_{-i}$ denote the space of bids of all agents other than agent $i$ . Therefore $\Theta_{-i}=\Theta_{1}\times\ldots,\times\Theta_{i-1}\times\Theta_{i+1}\times\ldots\times\Theta_{K}$ . We denote by $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})$ the utility to agent $i$ at time $t$ when his bid is $b_{i}$ , his valuation is $\theta_{i}$ , the bid profile of the remaining agents is $b_{-i}$ and the click realization is $\rho$ . All agents are assumed to be rational and are interested in maximizing their own utilities.

In our setting the utility to an agent $i$ is computed as,

[TABLE]

The idea behind the computation of the utility is as follows. If an agent $i$ does not receive an allocation (that is, $\mathcal{A}_{i}(b_{i},b_{-i},\rho,t)=0$ ), his utility is also zero. He gets a non-zero utility only if he receives an allocation. If he receives an allocation and also a click ( $\rho_{i}(t)=1$ ), then his utility is the difference between his valuation for the click and the amount he has to pay to the search engine ( $\theta_{i}-P_{i}^{t}(b,\rho)$ ). If he does not receive a click ( $\rho_{i}(t)=0$ ), his utility is zero.

Definition 1

Dominant Strategy Incentive Compatible (DSIC) Babaioff2014 : A mechanism $M=\left\langle\mathcal{A},P\right\rangle$ is said to be dominant strategy incentive compatible if $\forall i\in[K],\forall b_{i}\in\Theta_{i}$ , $\forall b_{-i}\in\Theta_{-i},\forall\rho,\forall t,u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})\geq u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})$ .

Note that in the above definition, the truthfulness is demanded a posteriori to even the click realization Gatti2012 . Hence it is the strongest notion of truthfulness. Examples for weaker forms of truthfulness include those which take expectation over click realizations.

Definition 2

Individually Rational (IR): A mechanism $M=\left\langle\mathcal{A},P\right\rangle$ is said to be individually rational if $\forall i\in[K]$ , $\forall b_{-i}\in\Theta_{-i},\forall\rho,\forall t,u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})\geq 0$ .

Theorem 4.1

$\Delta$ -UCB mechanism is dominant strategy incentive compatible (DSIC) and individually rational (IR).

Proof

We analyze the scenarios where an agent $i$ bids his true valuation and receives an allocation and also when he does not. We show that in both these scenarios, bidding his true valuation $\theta_{i}$ is indeed a best response strategy. We only need to consider the exploitation rounds because in the exploration rounds, every agent is allocated a fixed number of rounds independent of his bids and these rounds are also free for agents.

Case 1: $\mathcal{A}_{i}(\theta_{i},b_{-i},\rho,t)=1$

This implies that when the agent bids his true valuation, he gets an allocation. Therefore $\widehat{\mu}_{i,t}^{+}\theta_{i}>\widehat{\mu}_{l,t}^{+}b_{l}$ for all the other agents $l$ . In particular, let agent $j$ be such that $j=\operatorname*{arg\,max}_{l\in[K]\setminus\{i\}}\widehat{\mu}_{l,t}^{+}b_{l}$ . The amount to be paid by agent $i$ is $P_{i}^{t}(\theta_{i},b_{-i},\rho)=\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}$ . If he receives a click then $u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})=\theta_{i}-\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}>0$ .

Overbid: If agent $i$ bids a value $b_{i}>\theta_{i}$ , he continues to receive an allocation and his payment is still the same, $P_{i}^{t}(b_{i},b_{-i},\rho)=\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}$ . Therefore his utility continues to be $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})=\theta_{i}-\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}=u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})$ . Therefore he does not benefit from an overbid.

Underbid: Suppose agent $i$ bids a value $b_{i}<\theta_{i}$ .

Case a: If $b_{i}$ is such that $\widehat{\mu}_{i,t}^{+}b_{i}<\widehat{\mu}_{j,t}^{+}b_{j}$ , the he fails to get an allocation as $\mathcal{A}(b_{i},b_{-i},\rho,t)=j\neq i$ . Then the utility to agent $i$ is $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})=0<u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})$ . Therefore he clearly loses his utility by such an underbid.

Case b: Suppose $b_{i}$ is such that $\widehat{\mu}_{i,t}^{+}\theta_{i}>\widehat{\mu}_{i,t}^{+}b_{i}>\widehat{\mu}_{j,t}^{+}b_{j}$ . That is agent $i$ bids in such a way that he wins the allocation even with an underbid. Then, if he gets a click, the amount he must pay to the center is $P_{i}^{t}(b_{i},b_{-i},\rho)=\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}$ . Therefore his utility $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})=\theta_{i}-\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}=u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})$ . He obtains the same utility as a truthful bid and there is no benefit from such an underbid.

Case 2: $\mathcal{A}_{i}(\theta_{i},b_{-i},\rho,t)=0$

This implies that when the agent bids his true valuation, he does not get an allocation. Suppose agent $j$ wins the allocation. $\mathcal{A}(\theta_{i},b_{-i},\rho,t)=j$ and $\widehat{\mu}_{i,t}^{+}\theta_{i}<\widehat{\mu}_{j,t}^{+}b_{j}$ .

Truthful bid: Since agent $i$ does not win an allocation with a truthful bid, his utility $u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})=0$

Overbid: Suppose agent $i$ bids in such a way that $b_{i}>\theta_{i}$ . We have two sub-cases here.

Case a: If $b_{i}$ is such that $\widehat{\mu}_{i,t}^{+}\theta_{i}<\widehat{\mu}_{j,t}^{+}b_{j}<\widehat{\mu}_{i,t}^{+}b_{i}$ , then agent $i$ wins the allocation. So, $\mathcal{A}_{i}(b_{i},b_{-i},\rho,t)=1$ . If he gets a click, he now has to make a payment $P_{i}^{t}(b_{i},b_{-i},\rho)=\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}$ . Now his utility $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})=\theta_{i}-\widehat{\mu}_{j,t}^{+}b_{j}/\widehat{\mu}_{i,t}^{+}$ $<0$ . And in particular $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})<u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})$ $=0$ . Therefore, such an overbid is clearly disadvantageous compared to a truthful bid.

Case b: Suppose $\widehat{\mu}_{i,t}^{+}\theta_{i}<\widehat{\mu}_{i,t}^{+}b_{i}<\widehat{\mu}_{j,t}^{+}b_{j}$ . The overbid by agent $i$ is not sufficient to make him win the allocation and agent $j$ wins the allocation, $\mathcal{A}(b_{i},b_{-i},\rho,t)=j$ . The utility of agent $i$ , $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})=0=u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})$ . Therefore there is no advantage for agent $i$ by this case of overbid.

Underbid: If agent $i$ bids in such a way that $b_{i}<\theta_{i}$ , he continues to lose the allocation and therefore his utility, $u_{i}(b_{i},b_{-i},\rho,t;\theta_{i})=0=u_{i}(\theta_{i},b_{-i},\rho,t;\theta_{i})$ . Since, the utility by an underbid remains the same as a truthful bid, there is clearly no advantage in underbidding.

All the above cases show that our mechanism is DSIC a posteriori to the click realizations. Also, in each of the above cases, note that the utility of an agent $i$ , $u_{i}(\theta_{i},b_{-i},\rho,t)\geq 0$ . Therefore, by truthful bidding he never gets a negative utility. This proves that our mechanism is individually rational.

We next discuss the regret incurred by $\Delta$ -UCB. We note that the regret analysis we provide differs in spirit from the worst case analysis in Babaioff2014 . The number of exploration rounds in Babaioff2014 is required to be $\Omega(T^{2/3})$ since the separation between the best and second best arm is fixed in an adversarial manner in their analysis. Our analysis does not resort to any adversarial arguments.

In order to prove our $\Delta$ -regret results, we will first need to prove several other lemmas.

Lemma 1

*Social Welfare UCB index:

For an agent $i$ , we define the social welfare UCB indices for agent $i$ as,*

[TABLE]

Then, $\forall t\;P\left(\left\{\omega:W_{i}\notin[\widehat{W}_{i,t}^{-}(\omega),\widehat{W}_{i,t}^{+}(\omega)])\right\}\right)\leq 2T^{-4}$ .

Proof

Let $\widehat{\mu}_{i,t}^{+}$ and $\widehat{\mu}_{i,t}^{-}$ denote the UCB and LCB indices for the estimate $\widehat{\mu}_{i}$ . Then the events $\{\omega:\mu_{i}\notin[\widehat{\mu}_{i,t}^{-}(\omega),$ $\widehat{\mu}_{i,t}^{+}(\omega)]\}$ and $\{\omega:W_{i}\notin[\widehat{W}_{i,t}^{-}(\omega),\widehat{W}_{i,t}^{+}(\omega)]\}$ are identical. So, $P(W_{i}\notin[\widehat{W}_{i,t}^{-},\widehat{W}_{i,t}^{+}])=P(\mu_{i}\notin[\widehat{\mu}_{i,t}^{-},\widehat{\mu}_{i,t}^{+}])$ . An application of Hoeffding bound [hoeffding1963probability ] gives $P(\mu_{i}\notin[\widehat{\mu}_{i,t}^{-},\widehat{\mu}_{i,t}^{+}])\leq 2\exp(-2N_{i,t}\epsilon_{i,t}^{2})$ . As per the mechanism $\epsilon_{i,t}=\sqrt{2\log T/N_{i,t}}$ . So,

$P(\mu_{i}\notin[\widehat{\mu}_{i,t}^{-},\widehat{\mu}_{i,t}^{+}])\leq 2\exp(-2N_{i,t}\times 2\log T/N_{i,t})=2T^{-4}$ .

Lemma 2

Suppose at time step $t$ , $N_{i,t}>\frac{8\theta_{max}^{2}\log T}{\Delta^{2}}\;\forall i\in[K]$ . Then $\forall i\in[K]$ , $2\epsilon_{i,t}\theta_{i}<\Delta$ .

Proof

Given that $N_{i,t}>\frac{8\theta_{max}^{2}\log T}{\Delta^{2}}$ . Therefore,

[TABLE]

Taking square roots on both sides of the above equation yields $\Delta>2\epsilon_{i,t}\theta_{i}$ thereby proving the lemma.

Lemma 3

Suppose $K\ll T$ . For an agent $i$ and time step $t$ , let $B_{i,t}$ be the event $B_{i,t}=\{\omega:W_{i}\notin[\widehat{W}_{i,t}^{-},\widehat{W}_{i,t}^{+}]\}$ . Define the event $G=\bigcap\limits_{t}\bigcap\limits_{i\in[K]}B_{i,t}^{c}$ , where $B_{i,t}^{c}$ is the complement of $B_{i,t}$ . Then $P(G)\geq 1-\frac{2}{T^{2}}$ .

Proof

From Lemma 1, the probability of the ‘bad’ event, $P(B_{i,t})\leq 2T^{-4}$ .

[TABLE]

The last statement follows by summing over all rounds and using the fact that $K\ll T$ .

Theorem 4.2

Suppose at time step $t$ , $N_{j,t}>\frac{8\theta_{max}^{2}\log T}{\Delta^{2}}\forall j\in[K]$ . Then $\forall i\in[K]\setminus S_{\Delta}$ , $\widehat{W}_{i_{*},t}^{+}>\widehat{W}_{i,t}^{+}$ with high probability ( $=1-2/T^{4}$ ).

Proof: In Theorem 4.1, we have shown that $\Delta$ -UCB is DSIC. Therefore, all the agents bid their valuations truthfully, $b_{i}=\theta_{i}\;\forall i\in[K]$ . Suppose in exploitation round $t$ , a sub-optimal arm $i$ is pulled. Therefore, $\widehat{W}_{i,t}^{+}\geq\widehat{W}_{i_{*},t}^{+}$ . Then one of the following three conditions must have happened.

Condition 1: $W_{i}<\widehat{W}_{i,t}^{-}$ . This condition implies a drastic overestimate of the sub-optimal arm $i$ so that the true social welfare $W_{i}$ is even below the LCB index $\widehat{W}_{i,t}^{-}$ . Figure 1 shows this case.

Condition 2: $W_{*}>\widehat{W}_{i_{*},t}^{+}$ . This implies an underestimate of the optimal arm so that the true social welfare $W_{*}$ lies above even the UCB index $\widehat{W}_{i_{*},t}^{+}$ .

**Condition 3: $W_{*}-W_{i}<2\epsilon_{i,t}\theta_{i}$ . ** This implies an overlap in the confidence intervals of the optimal and sub-optimal arm. Even though Conditions 1 and 2 are false, still the UCB of sub-optimal arm $i$ is greater than the UCB of the optimal arm $i_{*}$ .

From Figure 3, $W_{*}-W_{i}\leq\widehat{W}_{i,t}^{+}-\widehat{W}_{i,t}^{-}\leq\;2\epsilon_{i,t}\theta_{i}$

If all the three conditions above were false, then,

[TABLE]

This implies that $\widehat{W}_{i_{*},t}^{+}>\widehat{W}_{i,t}^{+}$ , leading to a contradiction.

As per the statement of the theorem, $N_{i,t}>\frac{8\theta_{max}^{2}\log T}{\Delta^{2}}$ . Therefore by Lemma 2, $2\epsilon_{i,t}\theta_{i}<\Delta$ . For $i\in[K]\setminus S_{\Delta}$ , $W_{*}-W_{i}>\Delta>2\epsilon_{i,t}\theta_{i}$ . So Condition 3 above does not hold true. So if the sub-optimal arm $i$ must have been pulled, only possibilities are for Condition 1 or 2.

[TABLE]

thereby completing the proof.

We are now ready to state our main result on the incurred regret.

Theorem 4.3

If the $\Delta$ -UCB mechanism is executed for a total time horizon of $T$ rounds, it achieves an expected $\Delta$ -regret of $O(\log T)$ .

Proof

The main idea in the proof is to compute the $\Delta$ -regret conditional on two events - $G$ and $G^{c}$ and then to find a bound for these two conditional expectations.

[TABLE]

The last step comes from the fact that Conditions 1 and 2 in the proof of Theorem 4.2 are eliminated as we are given that the event $G$ has occurred. After exploration rounds, $N_{i,t}\geq 8K\theta_{max}^{2}\log T/\Delta^{2}$ . From Theorem 4.2, no $\Delta$ -regret occurs during exploitation since $G$ is true. Therefore the regret is only incurred during the exploration rounds.

We now compute $\mathbb{E}\left[\Delta\text{-regret}|G^{c}\right]$ .

[TABLE]

But $P(G^{c})=1-P(G)<\frac{2}{T^{2}}$ from Lemma 3.

Putting all the steps together,

[TABLE]

The second term is less than 2 as $\theta_{max}\ll T$ . This completes the proof.

A consequence of the above theorem is that even if an adversary chooses an arbitrary small gap between the best and second best arm, there is nothing to worry for the planner - if the gap is less than his tolerance $\Delta$ , no loss is incurred as opposed to the otherwise $\Omega(T^{2/3})$ loss in Babaioff2014 .

4.2 A Lower Bound for $\Delta$ -regret

We will now discuss a lower bound for the $\Delta$ -regret incurred by our approach. In particular, we will provide the lower bound for the case where $\theta_{i}=1$ for all $i$ and is known. The proof will follow along the lines of the lower bound proof in Bubeck2012a . The same lower bound will also naturally apply to the case of the general strategic version as well, since we our proposed mechanism $\Delta$ -UCB is truthful and achieves a matching upper bound.

Let $kl(p,q)$ denote the KL divergence between the distributions Bernoulli( $p$ ) and Bernoulli( $q$ ). Then $kl(p,q)=p\log p/q+(1-p)\log(1-p)/(1-q)$ .

Theorem 4.4

Consider the setting where $\theta_{i}=1\forall i\in[K].$ Suppose an algorithm satisfies $\mathbb{E}[N_{i,t}]=o(t^{a})$ for any set of Bernoulli reward distributions and for all arms $i\notin S_{\Delta}$ and $a>0$ . Then for any set of Bernoulli reward distributions we have,

[TABLE]

where $\mu^{*}=\operatorname*{arg\,max}_{j\in[K]}\mu_{j}$ , $\Delta_{i}=\mu^{*}-\mu_{i}$ for all $j\in[K]$ .

Proof

We will provide the proof for the case of two agents. The proof for the case $K>2$ follows analogously. Assume that $\mu_{2}\leq\mu_{1}\leq 1$ and $\mu_{1}-\mu_{2}>\Delta$ . Therefore agent $1$ is optimal and agent 2 does not belong to $S_{\Delta}$ . For any $\epsilon>0$ , due to the continuity of $kl(\mu_{2},x)$ , we can find $\mu^{\prime}_{2}\in(\mu_{1}+\Delta,1)$ such that

[TABLE]

This configuration then corresponds to an alternate setting where the mean of agent 2 is $\mu^{\prime}_{2}$ . In this alternate setting, $\mu^{\prime}_{2}-\mu_{1}>\Delta$ and agent 2 is the unique optimal. For $s\in\{1,\ldots,T\}$ , let,

[TABLE]

It can be verified that $\lim_{t\rightarrow\infty}\mathbb{E}[\tilde{kl}_{t}]/t=kl(\mu_{2},\mu^{\prime}_{2})$ (where the expectation is taken over $\rho_{2}^{t}$ ) and therefore $\tilde{kl}_{t}$ serves as an un-normalized estimate for $kl(\mu_{2},\mu^{\prime}_{2})$ .

Let $C_{T}$ denote the following random variable,

[TABLE]

One may verify that $\mathbb{P}_{\mu^{\prime}_{2}}(C_{T}=1)=\mathbb{E}_{\mu_{2}}[C_{T}\exp(-\tilde{kl}_{N_{2,T}})]$ by applying a change of measure. We will now show that $\mathbb{P}_{\mu_{2}}(C_{T}=1)\rightarrow 0$ as $T\rightarrow\infty$ . This is due to the following:

[TABLE]

Therefore, setting $f_{T}=\frac{(1-\epsilon)\log T}{kl(\mu_{2},\mu^{\prime}_{2})}$ , and applying Markov inequality we get,

[TABLE]

The last step arises as a consequence of $T-N_{2,T}=N_{1,T}$ and agent 1 is sub-optimal for the setting where agent 2 has the mean reward of $\mu^{\prime}_{2}$ .

We will finally show that $\mathbb{P}_{\mu_{2}}(N_{2,T}<f_{T})\rightarrow 0$ .

[TABLE]

Note that $kl(\mu_{2},\mu^{\prime}_{2})>0$ and $\frac{1-\epsilon/2}{1-\epsilon}\geq 1$ . Therefore by an application of the strong law of large numbers, we have

[TABLE]

Since $\mathbb{P}_{\mu_{2}}(C_{T}=1)\rightarrow 0$ , we must have $\mathbb{P}_{\mu_{2}}(N_{2,T}<f_{T})\rightarrow 0$ as well. Applying Markov inequality again, we get,

[TABLE]

The last step is obtained by applying Equation 8. This completes the proof. Note the key difference between our proof and Bubeck2012a lies in Equation 8. Our RHS in Equation 8 is necessary to ensure that in the alternate scenario agent $1$ is sub-optimal.

Remark 1

The lower bound for the expected $\Delta$ -regret Theorem 4.4 is quite similar to the lower bound for the regret of the UCB algorithm in Bubeck2012a . The difference is that the KL divergence term in the bound is also a function of the parameter $\Delta$ . Intuitively instead of considering the KL divergence between $KL(\mu_{i},\mu^{*})$ , we give an allowance of $\Delta$ for the optimal agent.

5 Extension to Multi-Slot SSA

In the previous sections, we assumed that there was a single slot for which the advertisers were competing. We now look at a more general setting where there are $M$ slots to be allocated to the $K$ agents. As before, each advertiser has exactly one ad for display and the CTR for advertisement $i$ is denoted by $\mu_{i}$ . Recall that in the case of single slot auctions, the CTR exactly denoted the probability with which an ad received a click. However in the generalized setting of multi-slot auctions, an additional parameter comes into play while computing the click probability due to which the problem becomes much harder gatti2015truthful .

Each position or slot $m$ is associated with a parameter $\lambda_{m}$ called ‘prominence’. $\lambda_{m}$ denotes the probability with which a user observes an ad at slot $m+1$ given he has observed the ad at slot $m$ . In order to understand the need for this parameter, a useful scenario to imagine is the listing of web-pages in Google for a query. There are two phases that one can think of once the listing of pages or results have appeared.

Phase 1: This is the phase where a user scans through the pages listed. A page listed higher up in the ranking (say second from the top) has more chances of being observed by a user rather than a page that is far below in the ranking (say fifth from the top). $\lambda_{4}$ , for instance, denotes the probability that a user observes the fifth page, given he has observed the fourth page. Coming back to sponsored ads, we assume that $\lambda_{0}=1$ , that is, the ad listed in the first slot is surely observed. We denote by $\Gamma_{m}$ the probability that an ad at slot $m$ is observed. $\Gamma_{m}$ is computed as,

[TABLE]

This modeling assumption for $\Gamma_{m}$ is known as position dependent cascade model.

Phase 2: After having scanned through the list, the user decides to click one or more of the shown ads. In the multi-slot setting Gatti2012 , it is assumed that multiple ads in a listing may receive clicks. The probability that ad $i$ receives a click when shown at slot $m$ = $\Gamma_{m}\mu_{i}$ .

We assume that $\lambda_{m}$ , $m=1,\ldots,M$ are known to the planner a-priori. The problem of learning these parameters along with the CTR $\mu$ is much harder in the presence of strategic agents. Therefore, in this section, we work with the assumption that the $\lambda$ s and hence $\Gamma$ s are known. In Section 6.2, we give pointers for design of mechanisms where the $\Gamma$ s are unknown.

The above modeling assumptions are as per standard conventions gatti2015truthful . In the multi-slot setting, the allocation is given to multiple agents at every time step. We denote by $\mathcal{A}(b,\rho,t)$ $\subset\{1,\ldots,K\}$ , the allocation at time $t$ for bids $b$ and click realization $\rho$ . The cardinality of the allocated set $|\mathcal{A}(b,\rho,t)|=M$ . We also use the notation $\mathcal{A}_{i}(b,\rho,t)=m$ to denote the allocation to agent $i$ at time $t$ is slot $m$ , for the bid profile $b$ , click realization $\rho$ . If an agent $i$ is not allocated any of the $M$ slots at time $t$ , we say $\mathcal{A}_{i}(b,\rho,t)=0$ .

We denote by $W_{i,m}$ the social welfare of agent $i$ , when he is given slot $m$ . $W_{i,m}$ is the expected valuation that agent $i$ receives when he is given slot $m$ and is computed as,

[TABLE]

For ease of reference, the additional relevant parameters for the multi-slot setting are provided in Table 3.

Having described the multi-slot setting, we now analyze the scenario from the view point of the search engine or central planner. In the ideal scenario, the planner would like to allot the ads exactly to the top $M$ agents with the largest social welfare. This use case has been studied in the literature Gatti2012 and exploration separated mechanisms with regret of $O(T^{2/3})$ have been proposed. Various possible allocations are explored for $O(T^{2/3})$ time steps for every agent after which the allocation algorithm is guaranteed to converge to the ideal allocation with high probability. As in the single slot case, $O(T^{2/3})$ exploration rounds are required to distinguish all the agents perfectly from each other, when there are agents whose social welfare values are arbitrarily close.

However, a much more practical problem of interest is to study and design mechanisms when the search engine is indifferent to a gap in $\Delta$ in social welfare for every slot. We observe that in cases where the agents are well-separated, $O(T^{2/3})$ exploration rounds are not required. In fact, $O(\log T)$ exploration rounds are sufficient to converge to an allocation that is well within the requirements of the search engine.

Having explained the problem, we now formalize the notions of separatedness in this setting. Let $K^{(1)},\ldots,K^{(M)}$ $\in[K]$ be the best $M$ agents in terms of their single slot social welfare values, that is, $\mu_{K^{(1)}}\theta_{K^{(1)}}>\mu_{K^{(2)}}\theta_{K^{(2)}}>\ldots>\mu_{K^{(M)}}\theta_{K^{(M)}}$ . Let $W_{*,m}=W_{K^{(m)},m}$ . The ideal solution would be to allocate agent $K^{(m)}$ the slot $m$ . This allocation would yield the largest social welfare but in the worst case, when the agents’ social welfares are separated by a function of $T$ , converging to this optimal allocation would require $O(T^{2/3})$ exploration rounds Gatti2012 . Instead, for a prescribed value of $\Delta$ fixed by the search engine, define the set,

[TABLE]

$S_{\Delta,m}$ is the set of all agents whose social welfare is at most $\Delta$ away from the agent $K^{(m)}$ ( who should have ideally been given slot $m$ ). The planner is indifferent to the regret contributed by the agents in $S_{\Delta,m}$ , if any of them are allotted slot $m$ . Hence we define the multi-slot $\Delta$ -regret metric as,

[TABLE]

The $\Delta$ -UCB mechanism for the multi-slot SSA is given in Algorithm 2.

We analyze the regret and truthfulness of Algorithm 2. The lemmas and theorems for establishing the results for the multi-slot setting are similar to the single slot setting, however there are subtle differences in proving many of the results. We will highlight them as and when necessary.

Theorem 5.1

In the multi-slot setting $\Delta$ - $UCB$ is Dominant Strategy Incentive Compatible (DSIC) and Individually Rational (IR).

Proof

The mechanism is an implementation of the weighted VCG scheme (with the weights for each agent $w_{i}=\mu_{i}^{+}/\mu_{i})$ and is hence DSIC and IR.

Lemma 4

For an agent $i$ and slot $m$ , the click through rate UCB indices for agent $i$ ,

[TABLE]

satisfy $P(\mu_{i}\notin[\widehat{\mu}_{i,t}^{-},\widehat{\mu}_{i,t}^{+}]))\leq 2T^{-4}\;\forall t$

Proof

At every time step, we observe samples $\rho_{I_{t,m}}(t),m=1,\ldots,M$ corresponding to the clicks of the allocated ads. These samples also encompass slot specific information which must be accounted for in the computation of empirical mean as well as UCB index for $\mu_{i}$ . For an agent $i$ , let the random variable $C_{i,m}$ denote whether ad $i$ receives a click at slot $m$ . Therefore $C_{i,m}$ is a Bernoulli random variable with bias $\Gamma_{m}\mu_{i}$ .

We obtain a sample $\rho_{i}(.)$ of $C_{i,m}$ when ad $i$ is allocated slot $m$ . However it is the samples from $C_{i,m}/\Gamma_{m}$ that gives us an unbiased estimator for $\mu_{i}$ . Therefore, the random variable of interest is the Bernoulli random variable,

[TABLE]

$D_{i,m}$ is bounded in $[0,1/\Gamma_{m}]$ and $\mathbb{E}[D_{i,m}]$ is $\mu_{i}$ . Also,

[TABLE]

Consider the scenario where, for an ad $i$ , a single sample click is available from each slot. Let $X_{i,m}$ denote this sample of $C_{i,m}$ . Assume $X_{i,m}$ are all independent and $\widehat{\mu}_{i}=1/M\sum_{m=1}^{M}X_{i,m}/\Gamma_{m}$ . $\mathbb{E}[\widehat{\mu_{i}}]=\mu_{i}$ . Now,

[TABLE]

In order to tighten the above bound on the right hand side, one must find appropriate $\lambda$ which minimizes $\exp(\sum_{m=1}^{M}\frac{\lambda^{2}}{8\Gamma_{m}^{2}}-\lambda M\epsilon)$ . Setting $\lambda=\lambda^{*}=4M\epsilon/\eta$ where $\eta=\sum_{m=1}^{M}1/\Gamma_{m}^{2}$ achieves the minimum value. Therefore,

[TABLE]

In order to obtain a $\delta$ confidence on $P(\widehat{\mu_{i}}-\mu_{i}>\epsilon)$ , $\epsilon$ must be set so that $\exp(-2M^{2}\epsilon^{2}/\eta)=\delta=T^{-4}$ . Therefore, $\epsilon=\sqrt{\sum\limits_{m=1}^{M}\left(\frac{1}{\Gamma_{m}^{2}}\right)\frac{2\log T}{M^{2}}}$ . In the above analysis we assumed that from each slot, one sample was available. When we have a total of $N_{i,t}$ independent samples for ad $i$ , with $M_{i,m}^{t}$ samples for slot $m$ at any time $t$ , $\eta=\sum_{m=1}^{M}M_{i,t}^{m}/\Gamma_{m}^{2}$ and therefore $\epsilon_{i,t}=\sqrt{\left(\sum\limits_{m^{\prime}=1}^{M}\frac{M_{i,t}^{(m^{\prime})}}{\Gamma_{m^{\prime}}^{2}}\right)\frac{2\log T}{N_{i,t}^{2}}}$ , completing the proof.

A noteworthy feature of our estimates is the following. An allocation of an ad $i$ in a slot $m$ yields a sample for the computation of not only $\widehat{W}_{i,m,t}$ , but also for $\widehat{W}_{i,m^{\prime},t}$ for all slots $m^{\prime}\in\{1,\ldots,M\}$ . This is because $\Gamma_{m}$ is known to the planner a-priori. Therefore note that, the number of allocations that ad $i$ receives till time $t$ , $N_{i,t}$ is the sum of the number of allocations that agent $i$ receives irrespective of the slot or inclusive of all the slots.

Lemma 5

For an agent $i$ and slot $m$ , the social welfare UCB indices for agent $i$ ,

[TABLE]

satisfy $P(W_{i,m}\notin[\widehat{W}_{i,m,t}^{-},\widehat{W}_{i,m,t}^{+}]))\leq 2T^{-4}\;\forall t$

Proof

The proof idea is similar to Lemma 1.

Lemma 6

Suppose at time step $t$ , $N_{j,t}>\frac{8\theta_{max}^{2}\log T}{\Delta^{2}}\;\forall j\in[K]$ . Then $\forall i\in[K]$ and $\forall m\in[M]$ , $2\epsilon_{i,m,t}<\Delta.$

Proof

The proof is similar to Lemma 2.

Lemma 7

For an agent $i$ , slot $m$ and time $t$ , let $B_{i,m,t}$ be the event $B_{i,m,t}=\{\omega:W_{i,m}\notin[\widehat{W}_{i,m,t}^{-}(\omega),\widehat{W}_{i,m,t}^{+}(\omega)]\}$ . Define the event $G=\bigcap\limits_{t}\bigcap\limits_{i}\bigcap\limits_{m}B_{i,m,t}^{c}$ . Then $P(G)\geq 1-\frac{2}{T^{2}}$ .

Proof: The proof has some subtle differences from Lemma 3 because in the multi-slot extension, the events $B_{i,m,t}$ are not independent across the slots.

Observation: If an element $\omega$ from the set of outcomes is such that $\omega\in B_{i,m,t}$ , then $\omega\in B_{i,m^{\prime},t}\;\forall m^{\prime}\in[M]$ . This is because, for any two slots $m$ and $m^{\prime}$ ,

[TABLE]

Therefore $P(\bigcup_{m}B_{i,m,t})=P(B_{i,1,t})$ . From Lemma 5,

$P(\bigcup_{m}B_{i,m,t})=P(B_{i,1,t})\leq 2T^{-4}$ . Hence,

[TABLE]

Theorem 5.2

Suppose at time $t$ , $N_{j,t}>8\theta_{max}^{2}\log T/\Delta^{2}\;\forall j\in[K]$ . Then $\forall m\in[M],\forall i\in[K]\setminus S_{\Delta,m}$ , $\widehat{W}_{K^{(m)},m,t}^{+}>\widehat{W}_{i,m,t}^{+}$ with high probability ( $=1-2/T^{4}$ ).

Proof: Suppose at time $t$ where $N_{j,t}>8\theta_{max}^{2}\log T/\Delta^{2}\;\forall j\in[K],$ there exists some $m\in[M]$ such that $\widehat{W}_{K^{(m)},m,t}^{+}<\widehat{W}_{i,m,t}^{+}$ . (Note that this statement does not arise from any assumptions on the allocation, for instance, that agent $i$ is given slot $m$ . This is the major difference from Theorem 4.2). But the relation between the true social welfare values of these agents is $W_{K^{(m)},m}>W_{i,m}$ . Then one of the following three conditions must have occurred, like in proof of Theorem 4.2.

Condition 1: $W_{i,m}<\widehat{W}_{i,m,t}^{-}$ . This condition implies a drastic overestimate of the sub-optimal arm $i$ so that the true mean social welfare $W_{i,m}$ is even below the LCB index $\widehat{W}_{i,m,t}^{-}$ . The figure below captures this condition.

Condition 2: $W_{K^{(m)},m}>\widehat{W}_{K^{(m)},m,t}^{+}$ . This implies an underestimate of the optimal arm so that the true mean social welfare $W_{K^{(m)},m}$ lies above even the UCB index $\widehat{W}_{K^{(m)},m,t}^{+}$ . See Figure 5 below.

Condition 3: $W_{K^{(m)},m}-W_{i,m}<2\epsilon_{i,m,t}$ . This implies an overlap in the confidence intervals of the optimal and sub-optimal arm. Even if, Conditions 1 and 2 are false, still the UCB of sub-optimal arm $i$ is greater than the UCB of the optimal arm $i_{*}$ .

From the figure, $W_{K^{(m)},m}-W_{i,m}\leq\widehat{W}_{i,m,t}^{+}-\widehat{W}_{i,m,t}^{-}\leq\;2\epsilon_{i,m,t}$ . If all the three conditions above were false, then,

[TABLE]

As per the statement of the theorem, $N_{i,t}>8\theta_{max}^{2}\log T/\Delta^{2}$ . Therefore by Lemma 6, $2\epsilon_{i,m,t}<\Delta$ . For agent $i\in[K]\setminus S_{\Delta,m}$ , $W_{K^{(m)},m}-W_{i,m}>\Delta>2\epsilon_{i,m,t}$ . Therefore, Condition 3 above does not hold true. So,

[TABLE]

Theorem 5.3

If the $\Delta$ -UCB mechanism is executed in the multiple slot scenario for a total time horizon of $T$ rounds, it achieves an expected $\Delta$ -regret of $O(\log T)$ .

Proof

The proof idea has some subtle differences from the proof of Theorem 4.3. As before, we first compute the expected $\Delta$ -regret conditional on $G$ . For the exploration rounds, the mechanism obtains a regret of $\xi=\frac{8MK\theta_{max}^{3}\log T}{\Delta^{2}}$ .

[TABLE]

We will now show that the second term above evaluates to zero. For any $m$ , the cardinality of $S_{\Delta,m}$ is at least $m$ . This is because for all $K^{(j)}$ above $m$ in the ranking of agents ( $j<m$ ), $W_{K^{(m)},m}-W_{K^{(j)},m}<0<\Delta$ as $W_{K^{(j)},m}>W_{K^{(m)},m}$ . Therefore there are at least $m-1$ agents in $S_{\Delta,m}$ . Also $K^{(m)}\in S_{\Delta,m}$ as $W_{K^{(m)},m}-W_{K^{(m)},m}=0<\Delta$ . Therefore $\forall j\in\{1,\ldots,m\},K^{(j)}\in S_{\Delta,m}$ . While allocating slot $m$ , at least one of the agents in $S_{\Delta,m}$ must be free. This is by the pigeonhole principle. Now if the allocated agent for slot $m$ , $I_{t,m}\in[K]\setminus S_{\Delta,m}$ , one of the following two cases occur.

Case 1: The ideal agents $K^{(1)},\ldots,K^{(m-1)}$ for all the previous slots ${1,\ldots,m-1}$ have already been allocated before the allocation of slot $m$ . This means that $K^{(m)}$ has not been allocated yet. Also, $\widehat{W}_{(I_{t,m}),m,\gamma}^{+}>\widehat{W}_{K^{(m)},m,\gamma}^{+}$ . Since $G$ is true and $t>\gamma$ , the above event cannot occur (by Theorem 5.2).

Case 2: The agent $K^{(m)}$ has already been allocated to some other slot before the allocation of slot $m$ has begun. Therefore there is some agent $K^{(j)},j<m$ with a larger social welfare value, who has still not been allocated. That is, $W_{K^{(j)},m}>W_{K^{(m)},m}>W_{(I_{t,m}),m}$ . Given that $I_{t,m}\notin S_{\Delta,m}$ . Therefore we can deduce that $I_{t,m}\notin S_{\Delta,j}$ . This is because,

[TABLE]

The last line in the above implications is true as $\Gamma_{j}>\Gamma_{m}$ . But $\widehat{W}_{K^{(j)},m,\gamma}^{+}<\widehat{W}_{(I_{t,m}),m,\gamma}^{+}$ . Then the inequality $\widehat{W}_{K^{(j)},j,\gamma}^{+}<\widehat{W}_{(I_{t,m}),j,\gamma}^{+}$ is also true due to the way the slot specific UCB indices are computed. From Theorem 5.2 for slot $j$ , we find that $\widehat{W}_{K^{(j)},j,\gamma}^{+}>\widehat{W}_{(I_{t,m}),j,\gamma}^{+}$ . Again this cannot happen as $G$ is true and $t>\gamma$ . Therefore we get that $\mathbb{E}\left[\Delta\text{-regret}|G\right]\leq\xi$ .

Also, $P(G^{c})=1-P(G)<\frac{2}{T^{2}}$ from Lemma 7.

Putting all the steps together,

[TABLE]

The simplification in the second line is because $\mathbb{E}\left[\Delta\text{-regret}|G^{c}\right]$ $\leq TM\theta_{max}$ . In the last line we use the fact that $M\ll T$ . This completes the proof.

6 Extensions to Other Variants of Multi-slot SSA

In this section, we look at other variants in the multi-slot SSA setting and discuss how our mechanism can be adapted to such settings.

6.1 Position and Ad Dependent Cascade Model

We have explained our algorithm and performed the analysis for the position dependent cascade model for SSA where the $\Gamma_{m}$ function is characterized by Equation 11 and is known to the planner a-priori. A more general model would be one where the function $\Gamma_{m}$ may also depend on the ad displayed at position $m$ . Our model can also be used in such scenarios and the same analysis will hold.

6.2 Handling the Case of Unknown $\Gamma_{m}$

We have assumed that the functions $\Gamma_{m}$ s are known to the planner a-priori. Now suppose that the $\Gamma_{m}$ s are required to be learnt. The same allocation scheme as in Algorithm 2 may be used. However the computation of the proposed payment scheme in Algorithm 2 is not feasible as the payments use $\Gamma_{m}$ s, which are unknown.

In order to handle such a scenario, we must obtain estimates for $\Gamma$ first. It is known that, the parameter for the first slot, $\Gamma_{1}=1$ . Only $\Gamma_{2},\ldots,\Gamma_{M}$ need to be estimated. We will first describe a mechanism which relies on an arbitrary learning algorithm to provide estimates $\widehat{\Gamma}_{2},\ldots,\widehat{\Gamma}_{M}$ . Thereafter we will remark on the possible learning schemes.

Proposition 1

Suppose we have a learning scheme that gives us estimates $\widehat{\Gamma}_{2},\ldots,\widehat{\Gamma}_{M}$ such that, $\widehat{\Gamma}_{2}\geq\widehat{\Gamma}_{3}\geq\ldots\geq\widehat{\Gamma}_{M}$ and $0\leq\widehat{\Gamma}_{m}\leq 1\text{ for }m=2,\ldots,M$ . Let $\widehat{\Gamma}_{1}=1$ .

We propose a weighted VCG mechanism [NISAN2007 ] which is known to be DSIC truthful and is also IR. Suppose the private valuation of agent $i$ for a click is $\theta_{i}$ . Let $x\in\{0,1\}^{K\times M}$ be an outcome of the allocation. $x_{im}=1$ if ad $i$ is alloted slot $m$ and zero otherwise. The valuation function of agent $i$ in this case is,

[TABLE]

Define a weight vector $w_{i}\in\mathcal{R}^{M}$ for every agent $i$ . $w_{i}$ has weights corresponding to agent $i$ and slot $m$ such that, $w_{i,m}=\frac{\widehat{\mu}_{i}^{+}\widehat{\Gamma}_{m}}{\mu_{i}\Gamma_{m}}$ . $\widehat{\mu}_{i}^{+}$ is the UCB index corresponding to the CTR of ad $i$ , computed after the fixed number of exploration rounds as in Algorithm 2. However, in this scenario, the UCB index is constructed using samples of the clicks from allocation in the first slot alone.

Our weighted VCG mechanism is described in Figure 7. The mechanism uses the allocation,

[TABLE]

But note that this allocation rule boils down to the same allocation used in Algorithm 2. This is due to the fact that the estimates $\widehat{\Gamma}_{m}$ monotonically decrease with $m$ . The procedure for obtaining the allocation $A^{*}(b_{i},b_{-i})$ is the following. We sort the agents based on $\widehat{\mu}_{i}^{+}b_{i}$ and allocate the slots to the best $M$ agents. Therefore, the allocation rule is independent of the $\Gamma$ s and is equivalent to,

[TABLE]

The expected payment to be made by agent $i$ when allocated a slot $m^{\prime}$ is,

[TABLE]

The above is the externality based payment prescribed by weighted VCG. However since we adopt the pay per click scheme,

[TABLE]

Therefore, the computation of the payments is also feasible now. The above mentioned weighted VCG scheme is DSIC truthful and IR. The proof follows from the standard weighted VCG scheme where the weights are as defined as above. We now remark on the $\Delta$ -regret of the mechanism.

6.2.1 Remarks on Learning $\widehat{\Gamma}_{m}$ and Computation of $\Delta$ -regret

In the above mechanism we have assumed, that the estimates $\widehat{\Gamma}_{m}$ satisfy Proposition 1. The allocation scheme described above ultimately does not rely on these estimates, although the weights $w_{i,m}$ use it. The mechanism therefore uses the estimates only in the payment rule. We now make an important observation here.

Observation: When any set of estimates $\{\widehat{\Gamma}_{m}\}$ , $m=1,\ldots,M$ satisfying Proposition 1 is used in the mechanism above, the mechanism is DSIC truthful, IR and suffers only logarithmic $\Delta$ -regret.

The reason is that the mechanism is an instance of weighted VCG mechanism and therefore is DSIC truthful and IR, with any estimate for the $\Gamma_{m}$ s. As far as the $\Delta$ -regret in social welfare is concerned, the allocation rule determines it. The allocation rule used turns out to be identical to the allocation rule used where $\Gamma_{m}$ is known and is independent of the estimates. Note that it is now possible to minimise regret in payments by choosing estimates $\widehat{\Gamma}_{m}$ that maximise the payments and also satisfy the constraints in Proposition 1. This will lead to a constrained optimization problem which can be solved. However the current work focuses on minimizing $\Delta$ -regret in social welfare and therefore the problem of minimising regret in payments is still open.

7 Conclusion

We have studied the more practical use case in MAB mechanisms where a planner has the option to specify a tolerance level $\Delta$ for sub-optimal arms. All the papers in the literature on MAB mechanisms propose schemes to target the worst case scenario where the arms are arbitrarily close. Therefore they prescribe investing a huge number of exploration rounds ( $\Omega(T^{2/3})$ ) to perfectly distinguish the arms. However, the planner may not want to perfectly distinguish arms that are arbitrarily close. Many a time, the planner may instead be willing to allocate arms that are at most $\Delta$ away from the best arm. The state of the art does not permit this flexibility to the planner. Towards providing such a flexibility to the planner, we have, for the first time, introduced a new notion of regret called $\Delta$ -regret. When arms that are less than $\Delta$ away from the best arm are selected, the $\Delta$ -regret incurred is zero. Only arms more than $\Delta$ away from the best arm contribute to the $\Delta$ -regret.

From the above perspective, we have revisited the application of MAB mechanisms in sponsored search auctions. First we analysed the single slot SSA setting and proposed a deterministic, exploration separated MAB mechanism called $\Delta$ -UCB. We showed that $\Delta$ -UCB is DSIC truthful, IR and achieves a $\Delta$ -regret of $O(\log T)$ . Next we studied the more challenging setting of multi-slot SSA. In particular, we adopted the cascade model and adapted $\Delta$ -UCB to this setting, first with the assumption that the prominence parameters are known. Here too, we have shown that the mechanism is DSIC truthful, IR and achieves a $\Delta$ -regret of $O(\log T)$ . We finally adapt the mechanism to the general multi-slot SSA setting where neither the CTRs nor the prominences are known. Here too our deterministic, exploration separated mechanism is DSIC truthful, IR and suffers a $\Delta$ -regret of $O(\log T)$ . The other mechanisms in literature for this setting are not able to obtain all these desirable properties that our mechanism achieves. They either compromise on the truthfulness, satisfying a weaker notion (truthfulness in expectation) or are forced to resort to randomness in the mechanism.

Our results are generic and apply equally well to several other applications where MAB mechanisms have been used.

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In COLT , pages 39.1–39.26, 2012.
2[2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2-3):235–256, 2002.
3[3] Moshe Babaioff, Robert D. Kleinberg, and Aleksandrs Slivkins. Truthful mechanisms with implicit payment computation. In Proceedings of the Eleventh ACM Conference on Electronic Commerce (EC’10) , pages 43–52. ACM, 2010.
4[4] Moshe Babaioff, Yogeshwer Sharma, and Aleksandrs Slivkins. Characterizing truthful multi-armed bandit mechanisms. SIAM Journal on Computing , 43(1):194–230, 2014.
5[5] Satyanath Bhat, Divya Padmanabhan, Shweta Jain, and Yadati Narahari. A truthful mechanism with biparameter learning for online crowdsourcing: (extended abstract). In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems (AAMAS’16), Singapore, May 9-13, 2016 , pages 1385–1386, 2016.
6[6] Arpita Biswas, Shweta Jain, Debmalya Mandal, and Y Narahari. A truthful budget feasible multi-armed bandit mechanism for crowdsourcing time critical tasks. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems (AAMAS’15) , pages 1101–1109, 2015.
7[7] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning , 5(1):1–122, 2012.
8[8] Sébastien Bubeck, Nicolò Cesa-Bianchi, and Gábor Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory , 59(11):7711–7717, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Dominant Strategy Truthful, Deterministic Multi-armed Bandit Mechanisms with Logarithmic Regret for Sponsored Search Auctions

Abstract

Keywords:

1 Introduction

Contributions:

2 Relevant Work

3 The Model: Single Slot SSA

4 Our Mechanism: Δ\DeltaΔ-UCB

4.1 Properties of Δ\DeltaΔ-UCB

Definition 1

Definition 2

Theorem 4.1

Proof

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 4.2

Theorem 4.3

Proof

4.2 A Lower Bound for Δ\DeltaΔ-regret

Theorem 4.4

Proof

Remark 1

5 Extension to Multi-Slot SSA

Theorem 5.1

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Theorem 5.2

Theorem 5.3

Proof

6 Extensions to Other Variants of Multi-slot SSA

6.1 Position and Ad Dependent Cascade Model

6.2 Handling the Case of Unknown Γm\Gamma_{m}Γm​

Proposition 1

6.2.1 Remarks on Learning Γ^m\widehat{\Gamma}_{m}Γm​ and Computation of Δ\DeltaΔ-regret

7 Conclusion

4 Our Mechanism: $\Delta$ -UCB

4.1 Properties of $\Delta$ -UCB

4.2 A Lower Bound for $\Delta$ -regret

6.2 Handling the Case of Unknown $\Gamma_{m}$

6.2.1 Remarks on Learning $\widehat{\Gamma}_{m}$ and Computation of $\Delta$ -regret