Adaptive Hedging under Delayed Feedback

Alexander Korotin; Vladimir V'yugin; Evgeny Burnaev

arXiv:1902.10433·cs.LG·June 25, 2019

Adaptive Hedging under Delayed Feedback

Alexander Korotin, Vladimir V'yugin, Evgeny Burnaev

PDF

TL;DR

This paper introduces a new adaptive hedging algorithm for online expert weight allocation that effectively handles delayed feedback, extending classical algorithms and providing theoretical regret bounds in adversarial settings.

Contribution

We develop the General Hedging algorithm $$ based on exponential reweighing, extending classical Hedge and Fixed Share algorithms to delayed feedback scenarios.

Findings

01

The algorithm $$ achieves adversarial loss bounds under delay.

02

It extends classical Hedge and Fixed Share algorithms to delayed feedback.

03

Provides regret bounds for both countable and continuous expert sets.

Abstract

The article is devoted to investigating the application of hedging strategies to online expert weight allocation under delayed feedback. As the main result, we develop the General Hedging algorithm $G$ based on the exponential reweighing of experts' losses. We build the artificial probabilistic framework and use it to prove the adversarial loss bounds for the algorithm $G$ in the delayed feedback setting. The designed algorithm $G$ can be applied to both countable and continuous sets of experts. We also show how algorithm $G$ extends classical Hedge (Multiplicative Weights) and adaptive Fixed Share algorithms to the delayed feedback and derive their regret bounds for the delayed setting by using our main result.

Equations171

w_{t} = {w_{t}^{n} for n \in N} \in Δ (N) .

w_{t} = {w_{t}^{n} for n \in N} \in Δ (N) .

h_{t} = n \in N \sum l_{t}^{n} \cdot w_{t}^{n} = ⟨ w_{t}, l_{t} ⟩,

h_{t} = n \in N \sum l_{t}^{n} \cdot w_{t}^{n} = ⟨ w_{t}, l_{t} ⟩,

R_{T} = H_{T} - n \in N min L_{T}^{n} .

R_{T} = H_{T} - n \in N min L_{T}^{n} .

p (l_{t} ∣ n_{t}) = p (l_{t}^{n_{t}} ∣ n_{t}) = \frac{e ^{- η l_{t}^{n_{t}}}}{Z},

p (l_{t} ∣ n_{t}) = p (l_{t}^{n_{t}} ∣ n_{t}) = \frac{e ^{- η l_{t}^{n_{t}}}}{Z},

p(\bm{l}_{t}|n_{t})=p(l_{t}^{n_{t}}|n_{t})\times\bigg{[}\prod_{n\neq n_{t}}p(l_{t}^{n}|n_{t})\bigg{]}=\frac{e^{-\eta l_{t}^{n_{t}}}}{Z}\times\frac{1}{H^{N-1}},

p(\bm{l}_{t}|n_{t})=p(l_{t}^{n_{t}}|n_{t})\times\bigg{[}\prod_{n\neq n_{t}}p(l_{t}^{n}|n_{t})\bigg{]}=\frac{e^{-\eta l_{t}^{n_{t}}}}{Z}\times\frac{1}{H^{N-1}},

L_{t}^{N_{t}} = τ = 1 \sum t l_{τ}^{n_{τ}} .

L_{t}^{N_{t}} = τ = 1 \sum t l_{τ}^{n_{τ}} .

L_{t} = (l_{1}, \dots, l_{t}), L_{D_{t}} = {l_{τ} for τ \in D_{t}}, L_{d D_{t}} = {l_{τ} for τ \in d D_{t}} .

L_{t} = (l_{1}, \dots, l_{t}), L_{D_{t}} = {l_{τ} for τ \in D_{t}}, L_{d D_{t}} = {l_{τ} for τ \in d D_{t}} .

\displaystyle p(N_{T},\bm{L}_{T})=p(N_{T})\cdot p(\bm{L}_{T}|N_{T})=\bigg{[}p_{0}(n_{1})\prod_{t=2}^{T}p(n_{t}|N_{t-1})\bigg{]}\cdot\bigg{[}\prod_{t=1}^{T}p(\bm{l}_{t}|n_{t})\bigg{]}.

\displaystyle p(N_{T},\bm{L}_{T})=p(N_{T})\cdot p(\bm{L}_{T}|N_{T})=\bigg{[}p_{0}(n_{1})\prod_{t=2}^{T}p(n_{t}|N_{t-1})\bigg{]}\cdot\bigg{[}\prod_{t=1}^{T}p(\bm{l}_{t}|n_{t})\bigg{]}.

\displaystyle H_{T}\leq-\frac{1}{\eta}\ln\bigg{[}\mathbb{E}_{p({N}_{T})}\big{[}e^{-\eta L_{T}^{{N}_{T}}}\big{]}\bigg{]}+\eta\frac{H^{2}}{8}T+\eta\big{[}\frac{H^{2}\cdot\sum_{t=1}^{T}D_{t}}{4}\big{]}.

\displaystyle H_{T}\leq-\frac{1}{\eta}\ln\bigg{[}\mathbb{E}_{p({N}_{T})}\big{[}e^{-\eta L_{T}^{{N}_{T}}}\big{]}\bigg{]}+\eta\frac{H^{2}}{8}T+\eta\big{[}\frac{H^{2}\cdot\sum_{t=1}^{T}D_{t}}{4}\big{]}.

\displaystyle R_{T}({N}_{T}^{*})=H_{T}-L_{T}^{{N}_{T}^{*}}\leq-\frac{1}{\eta}\ln p({N}_{T}^{*})+\eta\frac{H^{2}}{8}T+\eta\big{[}\frac{H^{2}\cdot\sum_{t=1}^{T}D_{t}}{4}\big{]}.

\displaystyle R_{T}({N}_{T}^{*})=H_{T}-L_{T}^{{N}_{T}^{*}}\leq-\frac{1}{\eta}\ln p({N}_{T}^{*})+\eta\frac{H^{2}}{8}T+\eta\big{[}\frac{H^{2}\cdot\sum_{t=1}^{T}D_{t}}{4}\big{]}.

\displaystyle-\frac{1}{\eta}\ln\bigg{[}\mathbb{E}_{p({N}_{T})}\big{[}e^{-\eta L_{T}^{{N}_{T}}}\big{]}\bigg{]}\leq-\frac{1}{\eta}\ln\bigg{[}p({N}^{*}_{T})e^{-\eta L_{T}^{{N}_{T}^{*}}}\bigg{]}=L_{T}^{{N}_{T}^{*}}-\frac{1}{\eta}\ln p({N}_{T}^{*}),

\displaystyle-\frac{1}{\eta}\ln\bigg{[}\mathbb{E}_{p({N}_{T})}\big{[}e^{-\eta L_{T}^{{N}_{T}}}\big{]}\bigg{]}\leq-\frac{1}{\eta}\ln\bigg{[}p({N}^{*}_{T})e^{-\eta L_{T}^{{N}_{T}^{*}}}\bigg{]}=L_{T}^{{N}_{T}^{*}}-\frac{1}{\eta}\ln p({N}_{T}^{*}),

w_{t}^{n} \propto p_{0} (n) \cdot e^{- η L_{D_{t}}^{n}}

w_{t}^{n} \propto p_{0} (n) \cdot e^{- η L_{D_{t}}^{n}}

\displaystyle R_{T}(n)=H_{T}-L_{T}^{n}\leq-\frac{1}{\eta}\ln p_{0}(n)+\eta\frac{H^{2}}{8}T+\eta\big{[}\frac{H^{2}\cdot\sum_{t=1}^{T}D_{t}}{4}\big{]}.

\displaystyle R_{T}(n)=H_{T}-L_{T}^{n}\leq-\frac{1}{\eta}\ln p_{0}(n)+\eta\frac{H^{2}}{8}T+\eta\big{[}\frac{H^{2}\cdot\sum_{t=1}^{T}D_{t}}{4}\big{]}.

η \propto \frac{1}{H T + \sum _{t = 1}^{T} D _{t}} or η \propto \frac{1}{H T ( 1 + E D )}

η \propto \frac{1}{H T + \sum _{t = 1}^{T} D _{t}} or η \propto \frac{1}{H T ( 1 + E D )}

p (n_{t} ∣ n_{t - 1}) = α_{t} p_{0} (n_{t}) + (1 - α_{t}) \cdot I_{[n_{t} = n_{t - 1}]}

p (n_{t} ∣ n_{t - 1}) = α_{t} p_{0} (n_{t}) + (1 - α_{t}) \cdot I_{[n_{t} = n_{t - 1}]}

w_{t}^{n} = p (n_{t} = n ∣ L_{D_{t - 1}}) = p (n_{t} = n ∣ L_{t - 1}) .

w_{t}^{n} = p (n_{t} = n ∣ L_{D_{t - 1}}) = p (n_{t} = n ∣ L_{t - 1}) .

w_{t}^{n} = p (n_{t} = n ∣ L_{t - 1}) =

w_{t}^{n} = p (n_{t} = n ∣ L_{t - 1}) =

\displaystyle\sum_{n^{\prime}\in\mathcal{N}}\bigg{[}p(n_{t}=n|n_{t-1}=n^{\prime})\cdot\underbrace{p(n_{t-1}=n^{\prime}|\bm{L}_{t-1})}_{u_{t-1}^{n^{\prime}}}\bigg{]}.

w_{t} = (1 - α_{t}) \cdot u_{t - 1} + α_{t} \cdot p_{0} .

w_{t} = (1 - α_{t}) \cdot u_{t - 1} + α_{t} \cdot p_{0} .

u_{t}^{n} = p (n_{t} = n ∣ L_{t}) = \frac{p ( L _{t} ∣ n _{t} = n ) \cdot p ( n _{t} = n )}{p ( L _{t} )} =

u_{t}^{n} = p (n_{t} = n ∣ L_{t}) = \frac{p ( L _{t} ∣ n _{t} = n ) \cdot p ( n _{t} = n )}{p ( L _{t} )} =

\frac{p ( l _{t} ∣ n _{t} = n ) \cdot p ( L _{t - 1} ∣ n _{t} = n ) \cdot p ( n _{t} = n )}{p ( L _{t} )} =

\displaystyle\underbrace{p(n_{t}=n|\bm{L}_{t-1})}_{w_{t}^{n}}\cdot\underbrace{p(\bm{l}_{t}|n_{t}=n)}_{\exp(-\eta\cdot l_{t}^{n})}\cdot\big{[}\frac{p(\bm{L}_{t-1})}{p(\bm{L}_{t})}\big{]}.

u_{t} \propto w_{t} \cdot exp (- η \cdot l_{t}) .

u_{t} \propto w_{t} \cdot exp (- η \cdot l_{t}) .

D_{t}^{τ} = {t^{'} ∣ (t^{'} + D_{t^{'}} \leq t) \land (t^{'} \leq τ)} .

D_{t}^{τ} = {t^{'} ∣ (t^{'} + D_{t^{'}} \leq t) \land (t^{'} \leq τ)} .

w_{t} = (1 - α_{t}) \cdot u_{t - 1} + α_{t} \cdot p_{0} .

w_{t} = (1 - α_{t}) \cdot u_{t - 1} + α_{t} \cdot p_{0} .

v^{n_{τ}} = p (n_{τ} ∣ L_{D_{t}^{τ - 1}}) = n_{τ - 1} \in N \sum p (n_{τ - 1} ∣ L_{D_{t}^{τ - 1}}) \cdot p (n_{τ} ∣ n_{τ - 1}) =

v^{n_{τ}} = p (n_{τ} ∣ L_{D_{t}^{τ - 1}}) = n_{τ - 1} \in N \sum p (n_{τ - 1} ∣ L_{D_{t}^{τ - 1}}) \cdot p (n_{τ} ∣ n_{τ - 1}) =

\displaystyle\sum_{n_{\tau-1}\in\mathcal{N}}\bigg{[}p(n_{\tau-1}|\bm{L}_{\mathcal{D}_{t}^{\tau-1}})\cdot\big{[}\alpha_{\tau}\cdot p_{0}(n_{\tau})+(1-\alpha_{\tau})\cdot\mathbb{I}_{[n_{\tau}=n_{\tau-1}]}\big{]}\bigg{]}=

\displaystyle\underbrace{\big{[}\sum_{n_{\tau-1}\in\mathcal{N}}p(n_{\tau-1}|\bm{L}_{\mathcal{D}_{t}^{\tau-1}})\big{]}}_{\text{Sums to }1}\cdot\alpha_{\tau}\cdot p_{0}(n_{\tau})+(1-\alpha_{\tau})\cdot u_{\tau-1}^{n_{\tau}}=

α_{τ} \cdot p_{0} (n_{τ}) + (1 - α_{τ}) \cdot u_{τ - 1}^{n_{τ}} .

v = (1 - α_{τ}) \cdot u_{τ - 1} + α_{τ} \cdot p_{0} .

v = (1 - α_{τ}) \cdot u_{τ - 1} + α_{τ} \cdot p_{0} .

u_{τ}^{n_{τ}} = p (n_{τ} ∣ L_{D_{t}^{τ}}) \propto p (L_{D_{t}^{τ}} ∣ n_{τ}) \cdot p (n_{τ}) =

u_{τ}^{n_{τ}} = p (n_{τ} ∣ L_{D_{t}^{τ}}) \propto p (L_{D_{t}^{τ}} ∣ n_{τ}) \cdot p (n_{τ}) =

p (L_{D_{t}^{τ - 1}} ∣ n_{τ}) \cdot p (l_{τ} ∣ n_{τ}) \cdot p (n_{τ}) \propto p (n_{τ} ∣ L_{D_{t}^{τ - 1}}) \cdot p (l_{τ} ∣ n_{τ}) \propto

p (n_{τ} ∣ L_{D_{t}^{τ - 1}}) \cdot exp (- η l_{τ}^{n_{τ}}) = v^{n_{τ}} \cdot exp (- η l_{τ}^{n_{τ}}) .

u_{τ} \propto v \cdot exp (- η \cdot l_{τ}) .

u_{τ} \propto v \cdot exp (- η \cdot l_{τ}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive Hedging under Delayed Feedback

Alexander Korotin

[email protected]

Vladimir V’yugin

[email protected]

Evgeny Burnaev

[email protected]

Skolkovo Institute of Science and Technology

Abstract

The article is devoted to investigating the application of hedging strategies to online expert weight allocation under delayed feedback. As the main result we develop the General Hedging algorithm $\mathcal{G}$ based on the exponential reweighing of experts’ losses. We build the artificial probabilistic framework and use it to prove the adversarial loss bounds for the algorithm $\mathcal{G}$ in the delayed feedback setting. The designed algorithm $\mathcal{G}$ can be applied to both countable and continuous sets of experts. We also show how algorithm $\mathcal{G}$ extends classical Hedge (Multiplicative Weights) and adaptive Fixed Share algorithms to the delayed feedback and derive their regret bounds for the delayed setting by using our main result.

keywords:

hedging , decision-theoretic online learning , experts problem , delayed feedback , adaptive algorithms , non-replicating algorithms , adversarial setting.

††journal: Neurocomputing

1 Introduction

We consider the Decision-Theoretic Online Learning (DTOL) framework [1, 2, 3, 4, 5, 3] which is closely related to the paradigm of prediction with expert advice [6, 7, 8, 9, 1, 10, 11, 12]. A master algorithm at every step ${t=1,\dots,T}$ of the game has to choose the weight allocation for a given pool of expert strategies (experts). We call this problem the experts problem. We investigate the adversarial case, i.e., no assumptions are made about the nature of the data (stochastic, deterministic, etc.).

The performance of the master algorithm is measured by the regret over the entire game. The regret $R_{T}$ is the difference between the cumulative loss of the online algorithm and the loss of some given comparator. A typical comparator is the best fixed expert in the pool or the best fixed convex linear combination of experts. The goal of the algorithm is to minimize the regret, i.e., $R_{T}\rightarrow\min$ .

In the classical online learning, the algorithm suffers loss of its decision at each step $t$ at the end of the same step the decision is made. In contrast to the classical scenario, we consider the delayed feedback learning. At each step $t$ of the game the algorithm makes a decision, and its result will be revealed only at the end of a time point $t+D_{t}$ (where $D_{t}\geq 0$ is some delay).

It turns out that there exists a wide range of algorithms for the non-delayed scenario ( $D_{t}\equiv 0$ ). Almost all of them exploit the follow-the-best-expert idea: the better expert performed in the past, the higher relative weight is assigned to the expert. The pure Follow the Leader (FTL) strategy is well-known to have good performance in the stochastic setting111Pure Follow the Leader strategy is known to be the minimax in the simplest stochastic setting (experts’ losses are i.i.d. between experts and time steps). [5], but it may be inefficient when the data is generated by an adversary (see discussion in [13, 14]).

Follow the Perturbed Leader (FTPL) algorithm [15] adds random noise to expert evaluation process. This prevents overfitting in adversarial setting. For example, exponential [15], random-walk [3], and dropout [16] noise has been shown to achieve low expected regret for the experts problem.

Follow the Regularized Leader222Equivalently, Online Mirror Descent, see [17]. (FTRL) is a powerful algorithm from online convex optimization framework [18], [13]. The usage of the linear loss function on a simplex allows to deal with the experts problem. The quadratic regularization leads to Online Gradient Descent (OGD) algorithm [13], the Entropic regularization provides Exponential Weights algorithm, also known as Hedge [2].

The idea of multiplicative weight updates (MW) of Hedge algorithm is used in many successive algorithms (MW2 [4], Variation-MW [19], Optimistic-MW [20], AEG-Path and AMEG-Path [21] and other algorithms [22, 23]). The main goal of such algorithms is to obtain the first or the second order regret bound (e.g. in terms of best expert’s loss) or achieve improvement for easy-data. Also, some Hedge-based algorithms (AdaHedge [24], Flip-Flop [14]) are designed to be parameter-free.

Almost all described algorithms provide $O(\sqrt{T})$ adversarial regret guarantees w.r.t. the best expert in the pool. Note that this bound is minimax optimal up to some multiplicative factor because $\Omega(\sqrt{T})$ is known to be the lower bound [6].333More precisely, the lower bound is $\Omega(\sqrt{T\ln N})$ , where $N$ is the number of experts in the finite pool.

An important variant of the experts problem is to develop an adaptive master algorithm. Such an algorithm has to track the shifts (switches) of the best expert and achieve low tracking regret with respect to shifting sequences of experts.444Sometimes in online learning the term adaptive means that the algorithm dynamically changes its learning rate during the game. Please do not get confused. There are many meta-approaches such as restarts [25, 26] or specialist experts [27] to create adaptive algorithms from non-adaptive ones. However, the most recognizable approach is to use the Fixed Share extension for Hedge [28, 29, 25, 11].

When it comes to the delayed feedback setting, many of the above described non-delayed algorithms do not have theoretical guarantees of performance or do not even have a modification for the delayed feedback setting.

There exists a bunch of meta-algorithms that allow to produce a version for delayed feedback setting from the basic non-delayed version [30, 31, 32, 33]. The roots of meta-approach lie in the work [30]. The authors studied the setting under fixed known feedback delay $D$ . They proved that the optimal (non-adaptive) algorithm is to run $D+1$ independent versions of the optimal non-delayed algorithm on $D+1$ disjoint time grids ${GR_{d}=\{t\mbox{ }|\mbox{ }t\equiv d\mbox{ }(\mbox{mod }D+1)\}}$ for $1{\leq d\leq D+1}$ . Thus, the optimal worst-case adversarial regret is $(D+1)\cdot\Omega(\sqrt{\frac{T}{D+1}})=\Omega(\sqrt{T(1+D)})$ . The described meta-approach was enhanced for the unknown and dynamic feedback delay in [33]. Their meta-algorithm BOLD (Black-box Online Learning with Delays) also runs independent copies of the basic algorithm on disjoint time lines.

We call algorithms obtained by meta-approaches (such as BOLD) replicated algorithms. Whereas replicating is simple and in some cases is theoretically optimal, it has several obvious practical drawbacks. Firstly, it uses only part of the observed data at every step of the game. Secondly, separate replicating learning processes generated by the meta-algorithm do not even interact.

Non-adaptive algorithms based on FTRL and FTPL have several non-replicated adaptations for delayed feedback setting. The most straightforward ones are Delayed OGD [34], Delayed FTPL and FTRL [35] and FTRL with Memory [36]. For the fixed and known feedback delay $D$ their best regret bound is $O(\sqrt{T(1+D)})$ , which is optimal.

In this work, we aim to create an adaptive non-replicated algorithm for the delayed feedback setting. We base our research on the Hedge algorithm (and its adaptive extension Fixed Share), which is the state-of-the-art basis for many existing algorithms. In order to achieve the desired goal, we develop the general probabilistic framework for Hedge-based algorithms. Using this framework, we propose the General Hedging Algorithm $\mathcal{G}$ , prove its loss bounds both for delayed and non-delayed cases. As a corollary of the main result, we show how classical non-delayed Hedge and Fixed Share algorithms (as the cases of $\mathcal{G}$ ) can be extended to the delayed feedback setting and what regret bounds they have.

The main contributions of this paper are:

Developing the General Hedging algorithm $\mathcal{G}$ for the delayed feedback scenario which is applicable to both non-delayed and delayed online settings. Proving the algorithm’s loss bound (and regret bound, for the case of a countable set of experts) in a general form. 2. 2.

Developing (for a finite number of experts) non-replicated versions of basic Hedge [2] and adaptive Fixed Share [28] algorithms (as special cases of algorithm $\mathcal{G}$ ) for the delayed feedback scenario as well as deriving their regret bounds.

The General Hedging algorithm $\mathcal{G}$ which we develop is motivated by the paper [10]. In that work the authors considered the special case of the prediction with experts’ advice with the logarithmic loss function. For the traditional non-delayed scenario ( $D_{t}\equiv 0$ ) they developed the Bayesian Merging Algorithm for mixing (averaging) experts’ predictions. Their algorithm is based on the natural graphical model (similar to the one in Figure 1 of Section 3) implied by the probabilistic origin of the logarithmic loss function.

In contrast to [10], we consider the decision-theoretic online learning scenario (hedging), which is more general than prediction with experts’ advice.555Hedging scenario assumes that the learner has access only to losses of experts while in prediction with experts’ advice the learner knows experts’ predictions and observes true outcomes (the losses are computed by using the known loss function). Prediction with experts’ advice can be reduced to Hedging by forgetting about the expert’s predictions and using only the computed losses of the experts. At the same time we investigate both non-delayed and delayed feedback settings. We build the artificial probabilistic framework for arbitrary bounded losses by using the entropithication transform (loss exponentiation, see e.g. [37, 38]), state the General Hedging algorithm $\mathcal{G}$ and prove its loss bound.

The article is structured as follows:

In Section 2 we give preliminary notions, describe the notation and the setting of the game of the delayed feedback experts’ weights allocation.

In Section 3 we describe the developed probabilistic framework, the main algorithm $\mathcal{G}$ , and formulate the main Theorem 1 about its loss bound. In Section 5 we prove the main theorem.

In Section 4 we provide the examples of the application of algorithm $\mathcal{G}$ : Delayed Hedge in Subsection 4.1, Delayed Fixed Share in Subsection 4.2.

In Section 6 we conduct massive computational experiments and provide the detailed discussion of the results.

In A we provide the necessary mathematical background.

2 Preliminaries

We use bold font to denote vectors (e.g. $\bm{w}\in\mathbb{R}^{M}$ for some integer $M$ ). In most cases, superscript is used for indexing elements of a vector (e.g. ${(w^{1},\dots,w^{N})=\bm{w}}$ ). Subscript is always used to indicate time (e.g. $l_{t},R_{T},w_{t}^{n}$ ).

We consider the online game of delayed hedging of a (finite or infinite) pool of experts. We use $\mathcal{N}$ to denote the pool and $n\in\mathcal{N}$ as an index of an expert. In this paper $\mathcal{N}$ is either a discrete set (e.g. ${\mathcal{N}=\{1,2,\dots,N\}}$ ) or a continuous subset of Euclidean space (e.g. ${\mathcal{N}=\mathbb{R}^{M}}$ ). By $\Delta(\mathcal{N})$ for a discrete (continuous) set $\mathcal{N}$ we denote all discrete (continuous) probability distributions on $\mathcal{N}$ .

For convenience, we do all calculations in the paper assuming that $\mathcal{N}$ is a discrete countable set. All the results also hold true for the continuous $\mathcal{N}$ but sums over $n$ (e.g. ${\sum_{n\in\mathcal{N}}[\ldots]}$ ) should be replaced with corresponding intergrals (e.g. ${\int_{n\in\mathcal{N}}[\ldots]\cdot dn}$ ).

At each integer time step ${t=1,2,\dots,T}$ of the game the master (hedging) algorithm has to assign the weights $w^{n}_{t}$ to all experts ${n\in\mathcal{N}}$ so that

[TABLE]

At the end of the step $t+D_{t}$ (for integer $D_{t}\geq 0)$ experts reveal their losses ${\bm{l}_{t}=\{l_{t}^{n}\text{ for }n\in\mathcal{N}\}}$ at the step $t$ . The loss of the algorithm’s decision of the step $t$ is

[TABLE]

i.e., the average experts’ loss w.r.t. $\bm{w}_{t}$ .

The sequence $D_{1},D_{2},\dots,D_{T}$ is called the sequence of delays. For simplicity, we assume that ${t+D_{t}\leq T}$ for all $t=1,2,\dots,T$ . In particular, $D_{T}=0$ . We denote the set of all time indices of the losses revealed before the end of the step $t$ by ${\mathcal{D}_{t}=\{\tau|\tau+D_{\tau}\leq t\}}$ . Also, we denote ${d\mathcal{D}_{t}=\mathcal{D}_{t}\setminus\mathcal{D}_{t-1}}$ .

There are many scenarios on how the sequence $D_{t}$ is chosen (randomly, adversarially) and whether it is known to the learner in advance or not (see e.g. [30, 31, 32, 39, 33]). Yet, we do not specify the particular scenario, and consider the game in the general form.

In this work we assume that all the losses are bounded: ${l_{t}^{n}\in[0,H]}$ for all ${t=1,\dots,T}$ and ${n\in\mathcal{N}}$ . This is a common assumption in online learning (see [13, 18] or any other survey on online learning). The game setting is described by the following Protocol 1.

We use $H_{T}=\sum_{t=1}^{T}h_{t}$ and $L_{T}^{n}=\sum_{t=1}^{T}l_{t}^{n}$ to denote the cumulative (total) loss of the algorithm and expert $n\in\mathcal{N}$ .

The performance of the algorithm is measured by the (cumulative) regret. The regret is the difference between the cumulative loss of the algorithm and the cumulative loss of some given comparator. A typical approach is to compete with the best expert in the pool. The cumulative regret with respect to the best expert is

[TABLE]

The goal of the algorithm is to minimize the regret, i.e., ${R_{T}\rightarrow\min}$ . In order to theoretically guarantee algorithm’s performance, some upper bound is usually proved for the cumulative regret ${R_{T}\leq f(T)}$ .

In the basic setting (1), sub-linear upper bound $f(T)$ for the regret leads to the asymptotic performance of the algorithm equal to the performance of the best expert. More precisely, we have $\lim_{T\rightarrow\infty}\frac{R_{T}}{T}=0$ .

3 Generalized Hedging Algorithm

In this section we describe the generalization $\mathcal{G}$ of the classical hedging algorithm based on exponential reweighing of experts’ losses. The basic algorithm was introduced by [2].

We investigate the adversarial case, i.e., no assumptions (stochastic, functional, etc.) are made about the nature of data (experts’ losses). However, it turns out that in this case it is convenient to develop algorithms using some probabilistic framework.

3.1 Probabilistic Framework

Recall that $\bm{l}_{t}=\{l_{t}^{n}\text{ for }n\in\mathcal{N}\}$ is a dictionary of experts’ losses at the step $t$ . The framework that we build implies that data is generated by some probabilistic model with hidden states. The graphical model is shown in Figure 1.

We suppose that there is some hidden sequence of experts $n_{t}\in\mathcal{N}$ (for ${t=1,2,\dots,T}$ ) that generates the experts’ losses $\bm{l}_{t}$ . In particular, hidden expert $n_{t}$ at step $t$ is called active expert. The conditional probability to observe the vector $\bm{l}_{t}$ of experts’ losses at the step $t$ is

[TABLE]

where $\eta>0$ is some fixed learning rate and $Z=\int_{l\in[0,H]}e^{-\eta l}dl$ is the normalizing constant. Constant $Z$ is independent of both $n_{t}$ and $t$ . The idea of conditional probability (2) is to assume that if at the step $t$ expert $n_{t}\in\mathcal{N}$ is active, then the loss vector $\bm{l}_{t}=(l_{t}^{1},l_{t}^{2},...,l_{t}^{N})$ is not completely random, i.e. loss $l_{t}^{n_{t}}$ is random, while all the other components are deterministic (e.g. given by nature).666Another definition of conditional probability is also possible. All the elements $(l_{t}^{1},l_{t}^{2},...,l_{t}^{N})$ can be considered as independent random variables. If expert $n_{t}\in\mathcal{N}$ is active, the probability of observing $l_{t}^{n_{t}}$ is equal to the current right-hand side of equation (2). All the other losses are i.i.d. uniform variables on $[0,H]$ . For the case of finite $\mathcal{N}$ the formula (2) is replaced by

$p(\bm{l}_{t}|n_{t})=p(l_{t}^{n_{t}}|n_{t})\times\bigg{[}\prod_{n\neq n_{t}}p(l_{t}^{n}|n_{t})\bigg{]}=\frac{e^{-\eta l_{t}^{n_{t}}}}{Z}\times\frac{1}{H^{N-1}},$

(3)

i.e. has an additional denominating factor of $H^{N-1}$ . However, for the infinite number of experts this approach requires a more detailed specification of probabilities in terms of measures, because the denominator becomes infinite. In Section 5 we will see that the exact value of the normalization constant $Z$ is important neither for the algorithm, nor for its regret bound. Thus, for convenience it is reasonable to consider the model (2).

For the first active expert $n_{1}$ some known prior distribution is given $p(n_{1})=p_{0}(n_{1})$ . The sequence $(n_{1},\dots,n_{T})$ of active experts is generated step by step. For $t\in\{1,\dots,T-1\}$ each $n_{t+1}$ is sampled from some known distribution $p(n_{t+1}|N_{t})$ , where $N_{t}=(n_{1},\dots,n_{t})$ .777In case $p(n_{t+1}|N_{t})=p(n_{t+1}|n_{t})$ , we obtain a traditional Hidden Markov Process: the hidden state at step $t+1$ depends only on the previous hidden state at step $t$ . Thus, active expert $n_{t+1}$ depends on the previous experts $N_{t}$ .

For every sequence of experts $N_{t}=(n_{1},n_{2},\dots,n_{t})$ we denote the cumulative loss of the sequence by

[TABLE]

For all $t$ we define the following lists of loss vectors:

[TABLE]

The considered probabilistic model is:

[TABLE]

The probability $p(N_{T})$ is that of hidden states (active experts).888The form $p({N}_{T})=p_{0}(n_{1})\prod_{t=2}^{t}p(n_{t}|{N}_{t-1})$ is used only for convenience and association with online scenario. It does not impose any restrictions on the type of probability distribution. In fact, $p(N_{T})$ may be any distribution on $\mathcal{N}^{T}$ of any form.

3.2 General Hedging Algorithm

The hedging algorithm 2 is shown below. We denote it by $\mathcal{G}=\mathcal{G}(p)$ ( $\mathcal{G}$ stands for General), where $p$ indicates the probability distribution $p({N}_{T})$ of active experts to which the algorithm is applied.

The idea of the algorithm $\mathcal{G}$ is simple: set the weight allocation $\bm{w}_{t}$ for the current step $t$ according to the posterior probability $p(n_{t}|\bm{L}_{\mathcal{D}_{t-1}})$ of the expert $n_{t}$ computed from the underlying probabilistic model. We illustrate this idea in Figure 2.

Consider a finite pool $\mathcal{N}=\{1,\dots,N\}$ and set $p_{0}(n_{1})\equiv\frac{1}{N}$ for all $n_{1}\in\mathcal{N}$ . Consider the non-delayed scenario ( $D_{t}\equiv 0$ for all $t$ ). If we use ${p(n_{1}=n_{2}=\dots=n_{T})\equiv 1}$ , the experts’ weights become $w_{t}^{n}\propto e^{-\eta L_{t-1}^{n}}$ . The resulting algorithm $\mathcal{G}(p)$ turns to be classical non-delayed Hedge (for more detailed discussion see Subsection 4.1). Also, non-delayed Fixed Share is the case of $\mathcal{G}$ for specially chosen Markovian $p(\cdot)$ (see Subsection 4.2).

The time and memory complexities of the algorithm depend on the properties of the underlying distribution $p$ . For Markovian models (when hidden state $n_{t}$ depends only on the previous state $n_{t-1}$ for all $t$ ) it is possible to provide linear in $T+\sum_{t=1}^{T}D_{t}$ schemes to compute weights (see Subsection 4.2) which require $O(N\cdot\max_{t}D_{t})$ memory. For the arbitrary $p(\cdot)$ time and memory complexity may be even exponential.

3.3 Guarantees of Performance

The algorithm has theoretical guarantees of performance. We state the following main theorem.

Theorem 1 (Adversarial loss bound for algorithm $\mathcal{G}$ ).

Let $\mathcal{N}$ be a countable (or continuous) set of experts. Let $p(\cdot)$ be a discrete (or continuous) distribution on $\mathcal{N}^{T}$ . Then for the hedging algorithm $\mathcal{G}$ applied to model $p$ with learning rate $\eta>0$ the following upper bound for the total loss over the entire game holds true:

[TABLE]

The proof of this theorem is given in Section 5. Note that while the algorithm may seem to be designed for the stochastic setting, we apply it to the pure adversarial case999The only assumption is that the losses are bounded, i.e. $l_{t}^{n}\in[0,H]$ for all $t=1,\dots,T$ and $n\in\mathcal{N}$ . and obtain the loss guarantees. At the same time, the adversarial loss bound (5) depends on the probability distribution $p(\cdot)$ for which the algorithm is designed.

One may wonder how Theorem 1 is applied to estimate the regret, for example, the regret with respect to the best expert (1). If the set $\mathcal{N}$ of experts is countable, then the following simple corollary holds true.

Corollary 1 (Adversarial regret bound for algorithm $\mathcal{G}$ ).

If the set of experts $\mathcal{N}$ is countable, then under the conditions of Theorem 1, the regret with respect to any sequence ${{N}_{T}^{*}=(n_{1}^{*},n_{2}^{*},\dots,n_{T}^{*})\in\mathcal{N}^{T}}$ is

[TABLE]

Proof.

The corollary results from the following inequality for the expectation in the right-hand side of (5):

[TABLE]

which leads to the desired bound.∎

If $\mathcal{N}$ is continuous under the conditions of Theorem 1, then the first term in the upper bound (5) is represented by the integral (instead of a countable sum). It is not possible to extract a single summand as in the finite case. However, sometimes the expectation can be directly computed or estimated w.r.t. the loss of the best expert in the pool. For example, see approaches of [40, 41, 42] applied to Online Kernel Regression.

The regret bound 6 is a linear function of game length $T$ . Nevertheless, if the game length $T$ is known in advance, one may achieve sub-linear regret bound by choosing the learning rate $\eta$ to be dependent on game length $T$ . Particular examples of learning rates $\eta=\eta(T)$ for specific underlying distributions $p(\cdot)$ are provided in following Section 4.

4 Examples

In this Section we provide the examples of useful underlying probability models $p(\cdot)$ and use them to apply the algorithm $\mathcal{G}$ to construct online expert weight allocation algorithms. We consider a finite pool of experts ${\mathcal{N}=\{1,2,\dots,N\}}$ .

4.1 Basic Delayed Exponential Weights (Hedge)

Consider the following underlying probability $p$ . Let ${p(n_{1})=p_{0}(n_{1})}$ be some prior and ${p(n_{t}|n_{t-1})=\mathbb{I}_{[n_{t}=n_{t-1}]}}$ for ${t=2,\dots,T}$ . This means that the hidden active expert does not change during the game. We denote the corresponding algorithm applied to $p$ by ${\mathcal{G}_{\text{base}}=\mathcal{G}_{\text{base}}(p_{0})}$ . The corresponding graphical model is shown in Figure 3.

It is easy to see that for all $t$ the weight allocation

[TABLE]

is proportional to the observed losses of the expert $n\in\mathcal{N}$ . If there are no delays ( $D_{t}\equiv 0$ for all $t$ ), then the algorithm becomes classical Hedge by [2].

4.1.1 Algorithm

The pseudo-code of Algorithm 3 ( $\mathcal{G}_{\text{base}}$ ) is shown below. In the code we assume that the operation Output( $\ldots$ ) sets the weight allocation ( $\bm{w}_{t}$ ) for the current step. Function GetRevealedLosses() obtains all the vectors of losses ${\bm{l}_{\tau}=(l_{\tau}^{1},\dots,l_{\tau}^{N})}$ of the steps ${\tau\in d\mathcal{D}_{t}}$ in the form of an iterable list of pairs ${(\tau,\bm{l}_{\tau})}$ .

The algorithm requires $O(N)$ memory and $O(NT)$ time complexity.

4.1.2 Regret bound

According to Corollary 1, the regret of the algorithm with respect to any fixed expert $n\in\mathcal{N}$ is bounded:

[TABLE]

The typical prior is $p_{0}\equiv\frac{1}{N}$ . For this basic case in the non-delayed feedback setting ( $D_{t}\equiv 0$ ) the $\eta$ is chosen in advance (with prior knowledge of $T$ ) to minimize the regret. The optimal choice is $\eta\propto\frac{1}{H\sqrt{T}}$ , which results in $O(\sqrt{T})$ classical regret.

However, the choice of optimal $\eta$ in the delayed setting highly depends on how the sequence of delays is generated. If the learner knows $\sum_{t=1}^{T}D_{t}$ in advance or $D_{t}$ is sampled from some distribution with known expectation $\mathbb{E}D$ , the optimal choice is

[TABLE]

respectively. This choice results in $O(\sqrt{T+\sum_{t=1}^{T}D_{t}})$ and $O(\sqrt{T(1+\mathbb{E}D)})$ regret bounds respectively.

If the sequence of delays is chosen by an adversary, the classical choice $\eta\propto\frac{1}{H\sqrt{T}}$ results in $O[\sqrt{T}(1+\overline{D})]$ regret, where $\overline{D}=\frac{1}{T}\sum_{t=1}^{T}D_{t}$ .

4.2 Adaptive Delayed Exponential Weights (Fixed Share)

Consider the following underlying probability $p$ . Let ${p(n_{1})=p_{0}(n_{1})}$ be some prior and

[TABLE]

for $t=2,\dots,T$ and sequence $0\leq\alpha_{2},\dots,\alpha_{T}\leq 1$ . This means that the hidden active expert changes to random (according to prior $p_{0}$ ) between steps $t-1$ and $t$ with some small probability $\alpha_{t}$ .

We denote the corresponding algorithm applied to $p$ by $\mathcal{G}_{\text{fs}}$ . The graphical model is shown in Figure 4.

The sequence $\alpha_{t}$ can be arbitrary. However, the classical approach is to use $\alpha_{t}=\frac{1}{t}$ (see [28, 25, 29]), because in this special case the regret bound is better (than e.g. in the case $\alpha_{t}\equiv const$ ). In our case at the end of the subsection we will also use the sequence $\alpha_{t}=\frac{1}{t}$ when estimating the regret.

4.2.1 Equivalence to Fixed Share in the Non-Delayed Setting

To begin with, we examine the application of algorithm $\mathcal{G}$ to the described probabilistic model $p(\cdot)$ in the non-delayed case, i.e., $D_{t}\equiv 0$ for all ${t=1,2,\dots,T}$ . In the non-delayed case we have $\mathcal{D}_{t}=\{1,2,\dots,t\}$ for all $t$ . Thus, for all $t$ we get

[TABLE]

We set ${\bm{u}_{t}=(u_{t}^{1},\dots,u_{t}^{N})\in\Delta(\mathcal{N})}$ , $u_{t}^{n}=p(n_{t}=n|\bm{L}_{t})$ for all ${n\in\mathcal{N}}$ and ${t=1,2,\dots,T}$ . We get

[TABLE]

Combining (8) with (7) we see that

[TABLE]

On the other hand,

[TABLE]

Thus,

[TABLE]

Formulas (9) and (10) mean that the algorithm’s decision $\bm{w}_{t}$ can be iteratively updated step by step by using the additional weight $\bm{u}_{t}$ . The obtained weight updates (9) and (10) exactly match the updates of the Fixed Share algorithm by [28]. Thus, in the non-delayed case $\mathcal{G}(p)$ is equal to Fixed Share.

4.2.2 Algorithm for the Delayed Setting

Now we examine the algorithm $\mathcal{G}(p)$ under the setting of the delayed feedback, i.e., for all ${t=1,2,\dots,T}$ delay $D_{t}$ is some non-negative integer value.

For all $t=1,2,\dots,T$ and $\tau\leq t$ we use $\mathcal{D}_{t}^{\tau}$ to denote the set of all time steps $t^{\prime}\leq\tau$ such that the loss vector $\bm{l}_{t^{\prime}}$ is revealed not later than the step $t$ . Formally, we define

[TABLE]

In the next few paragraphs we describe the efficient scheme to recompute the algorithms decision $\bm{w}_{t}$ at every step $t$ .

Suppose that at the beginning of the step $t$ we keep all the probabilities $p(n_{\tau}|\bm{L}_{\mathcal{D}_{t}^{\tau}})$ for all $\tau=1,\dots,t$ and $n\in\mathcal{N}$ . We denote the corresponding $N$ -dimensional probability vectors by $\bm{u}_{\tau}$ . We also denote $\bm{u}_{0}=\bm{p}_{0}$ . Similar to (8) calculations lead to the simple formula that allows to obtain $\bm{w}_{t}$ :

[TABLE]

After the decision on $\bm{w}_{t}$ is made, the algorithm obtains losses of steps ${\tau\in d\mathcal{D}_{t}}$ . Thus, we need to calculate new probability vector $\bm{u}_{t}$ with coordinates $p(n_{\tau}|\bm{L}_{\mathcal{D}_{t}^{\tau}})$ . Moreover, we have to update all vectors $\bm{u}_{\tau}$ for $\tau<t$ from $p(n_{\tau}|\bm{L}_{\mathcal{D}_{t-1}^{\tau}})$ to $p(n_{\tau}|\bm{L}_{\mathcal{D}_{t}^{\tau}})$ .

Let $\tau_{\min}=\min\{\tau:\tau\in d\mathcal{D}_{t}\}$ . Note that all $\bm{u}_{\tau}$ for $\tau<\tau_{\min}$ do not require being updated because $\mathcal{D}_{t}^{\tau}=\mathcal{D}_{t-1}^{\tau}$ . Next, for $\tau=\tau_{\min},\dots,t-1,t$ we recompute the vectors $\bm{u}_{\tau}$ iteratively.

We explain how to compute $\bm{u}_{\tau}$ below (assuming that previous $\bm{u}_{\tau-1}$ is already computed). For convenience, we introduce the temporary vector variable ${\bm{v}=(v^{1},\dots,v^{N})\in\Delta(\mathcal{N})}$ , where $v^{n}=p(n_{\tau}=n|\bm{L}_{\mathcal{D}_{t}^{\tau-1}})$ for all ${n\in\mathcal{N}}$ . First, we express $\bm{v}$ using $\bm{u}_{\tau-1}$ . Next, we express $\bm{u}_{\tau}$ using $\bm{v}$ .

We deduce the formula to compute $v^{n_{\tau}}$ by using $u_{\tau-1}^{n_{\tau}}$ :

[TABLE]

In line (11) we exploit the fact that the elements of $\mathcal{D}_{t}^{\tau-1}$ are strictly lower than $\tau$ . In line (12) we use the definition (7) of the transition probability. The vector form of (13) is

[TABLE]

To derive $\bm{u}_{\tau}$ using $\bm{v}$ we consider two cases: $\tau\notin\mathcal{D}_{t}$ and $\tau\in\mathcal{D}_{t}$ . In the first case ${\bm{L}_{\mathcal{D}_{t}^{\tau-1}}\equiv\bm{L}_{\mathcal{D}_{t}^{\tau}}}$ , which leads to $\bm{u}_{\tau}=\bm{v}$ . If $\tau\in\mathcal{D}_{t}$ , we have

[TABLE]

The vector form of expression (14) is

[TABLE]

The pseudo-code of algorithm 4 ( $\mathcal{G}_{\text{fs}}$ ) is shown below. In addition to the notations of Algorithm 3 ( $\mathcal{G}_{\text{base}}$ ), we assume that an extra function GetSwitchProbability() provides the value of the current switch probability $0\leq\alpha_{t}\leq 1$ (which may be chosen online).

The List() class corresponds to the dynamic array. We assume that it has integer index $\{0,1,\dots,|List|-1\}$ , supports the append-to-right operation in $O(1)$ time. We also assume that all operations to get or set list element (by index) require $O(1)$ time. At the end of each step $t$ list $\bm{u}$ keeps the posterior probabilities described above.

The time complexity of the algorithm is bounded by

[TABLE]

Indeed, at the steps $t$ such that $|d\mathcal{D}_{t}|=0$ the algorithm performs $O(N)$ operations. At other steps the algorithm performs

[TABLE]

operations which are bounded by $O(N\cdot D_{\tau})$ for the minimal $\tau\in d\mathcal{D}_{t}$ .

The memory complexity of the algorithm is $O(NT)$ . However, it is possible to significantly reduce the memory complexity. Note that if for some $\tau,t$ we have $\mathcal{D}_{t}^{\tau}=\{1,2,\dots,t\}$ , the weights $\bm{u}_{0},\dots,\bm{u}_{\tau-1}$ will never be used or recomputed after the step $t$ . Thus, they become useless, and it is meaningful to keep only elements $\bm{u}_{t^{\prime}}$ with $t^{\prime}\geq\tau$ (same for lists $\bm{l}$ and $\bm{\alpha}$ ). The reduction will result in $O(N\cdot\max D_{t})$ memory complexity. We did not include the explained trick in the pseudo-code of Algorithm 4 in order to keep it simple.

4.2.3 Regret Bound

We use $\alpha_{t}=\frac{1}{t}$ . We combine Corollary 1 with Lemma 3 and obtain the regret bound for the algorithm with respect to any switching sequence ${{N}_{T}=(n_{1},n_{2},\dots,n_{T})}$ :

[TABLE]

where $K=|\{t:\,n_{t}\neq n_{t-1}\}|$ is the number of expert’s switches in ${N}_{T}$ .

Similar to the non-adaptive case, the algorithm requires choosing optimal learning rate $\eta$ in order to minimize the regret bound. The optimal $\eta$ should be chosen with respect to $T$ and $\sum_{t=1}^{T}D_{t}$ .101010It is also possible to minimize the bound w.r.t. particular number of switches $K$ . The following discussion is similar to the one at the end of the previous subsection 4.1.

If the learner knows $\sum_{t=1}^{T}D_{t}$ beforehand or $D_{t}$ is sampled from some distribution with known expectation $\mathbb{E}D$ , the choice of

[TABLE]

respectively results in

[TABLE]

and

[TABLE]

(expected) regret bound with respect to any sequence with no more than $K$ expert switches.

If the sequence of delays is chosen by an adversary and unknown to the learner, then classical choice $\eta\propto\frac{\sqrt{\ln T}}{H\sqrt{T}}$ results in ${O[(K+2)\sqrt{T\ln T}(1+\overline{D})]}$ regret, where ${\overline{D}=\frac{1}{T}\sum_{t=1}^{T}D_{t}}$ .

5 Proof of Performance

In this section we prove Theorem 1. The proof is complicated, and we split it into two sequential parts. Firstly, we prove the bound (5) for the non-delayed case in Subsection 5.1, i.e., $\{D_{t}\}_{t=1}^{T}=(0,\dots,0)$ . Secondly, we obtain the bound (5) for arbitrary sequence of delays $\{D_{t}\}_{t=1}^{T}$ in Subsection 5.2.

5.1 Bound for Non-delayed Setting

We set $D_{t}\equiv 0$ for all $t$ and deal with the bound for algorithm $\mathcal{G}$ in this case. Note that $\mathcal{D}_{t}=\{1,2,\dots,t\}$ for all $t=1,2,\dots,T$ and $\bm{L}_{\mathcal{D}_{t}}=\bm{L}_{t}$ .

Proof.

Recall that $w_{t}^{n_{t}}=p(n_{t}|\bm{L}_{\mathcal{D}_{t-1}})=p(n_{t}|\bm{L}_{t-1})$ . Define the mixloss at the step $t$ :

[TABLE]

Define the cumulative mixloss $M_{T}$ over the entire game:

[TABLE]

For all $t=1,\dots,T$ we apply Hoeffding’s inequality (25) to a random variable

[TABLE]

where $n_{t}\sim p(n_{t}|\bm{L}_{t-1})=w_{t}^{n_{t}}$ :

[TABLE]

which is equal to

[TABLE]

We sum (17) for $t=1,2,\dots,T$ and obtain

[TABLE]

which finishes the proof.∎

5.2 Bound for Delayed Setting

In this section we consider the case of arbitrary sequence of delays $\{D_{t}\}_{t=1}^{T}$ .

Proof.

We use the superscript $(\ldots)^{\mathcal{D}}$ to denote the variables obtained by algorithm $\mathcal{G}$ (for example, weights $\bm{w}_{t}^{\mathcal{D}}$ , etc.) with the sequence of delays $\{D_{t}\}_{t=1}^{T}$ . Our main idea is to prove that the weights $\bm{w}_{t}^{\mathcal{D}}$ are approximately equal to the weights $\bm{w}_{t}^{0}$ obtained by the algorithm in the game with the same experts but with no delays, i.e., $\{D_{t}\}_{t=1}^{T}=(0,\dots,0)$ . Thus, the losses $h_{t}^{\mathcal{D}}$ and $h_{t}^{0}$ will be approximately equal.

We divide this part of the proof of the theorem into two steps:

Step 1. Proof for a simple probability distribution $p$

To begin with, we consider the case of a simple Hidden Markov Model $p(\cdot)$ . Let ${p(n_{1})=p_{0}(n_{1})}$ and ${p(n_{t+1}|n_{t})=\mathbb{I}_{[n_{t+1}=n_{t}]}}$ for all ${t=1,\dots,T-1}$ . The corresponding algorithm is $\mathcal{G}_{\text{base}}=\mathcal{G}(p)$ .

We compare the losses $H_{T}^{0}$ and $H_{T}^{\mathcal{D}}$ of algorithm $\mathcal{G}_{\text{base}}$ applied to the same data with no delays and with the given sequence of delays $\{D_{t}\}_{t=1}^{T}$ respectively.

[TABLE]

Note that ${(w_{t}^{n})^{0}\propto e^{-\eta L_{t-1}^{n}}}$ and ${(w_{t}^{n})^{\mathcal{D}}\propto e^{-\eta L_{\mathcal{D}_{t-1}}^{n}}}$ for all $t$ . This means that ${(w_{t}^{n})^{0}\propto[(w_{t}^{n})^{\mathcal{D}}\cdot a^{n}]}$ , where

[TABLE]

Thus, according to Lemma 1, we obtain the bound

[TABLE]

for all $t$ . Combining it with (19) and Lemma 30 we obtain:

[TABLE]

The final step is to combine current result with the loss bound (18) for the non-delayed case:

[TABLE]

and finish the proof of the bound for algorithm $\mathcal{G}_{\text{base}}$ .

Step 2. Proof for an arbitrary probability distribution $p$

Now we consider the case of an arbitrary probability distribution $p$ . From the given set of experts $\mathcal{N}$ we create a new super set $\mathcal{S}=\mathcal{N}^{T}$ of super experts $s\in\mathcal{S}$ ( $\mathcal{S}$ for Super). Each super expert $s$ corresponds to some sequence ${{N}_{T}=(n_{1},\dots,n_{T})\in\mathcal{N}^{T}}$ of basic experts $n\in\mathcal{N}$ of length $T$ . We denote the $t$ -th component of super expert $s$ by $n_{t}(s)$ . We denote the full sequence of experts corresponding to $s$ by ${N}_{T}(s)$ . We do not use subscript in order not to overburden the notation. The loss of super expert $s\in\mathcal{S}$ at the step $t$ is $l_{t}^{n_{t}(s)}$ , where $l_{t}^{n}$ (for $n\in\mathcal{N}$ ) are the losses of basic experts. We use $E(\bm{L}_{\mathcal{D}_{t}})$ and $E(\bm{l}_{t})$ to denote all the super experts’ losses at the steps $\mathcal{D}_{t}$ and $t$ respectively ( $E$ for Enhanced).

We define the probability model for hidden super experts. In order not to confuse the reader with notation, we use capital $P$ (instead of regular $p$ ) to denote all probabilities related to super experts. Let $P(s_{1})=P_{0}(s_{1})=p({N}_{T}(s_{1}))$ and

[TABLE]

The described probability distribution corresponds to algorithm $\mathcal{G}_{\text{base}}(P)$ for super experts $s\in\mathcal{S}$ and initial distribution $P_{0}$ . We have $s_{1}=s_{2}=\dots=s_{T}$ w.p. 1.

The main idea is to show that the losses of algorithm $\mathcal{G}_{\text{base}}$ are equal to the losses of algorithm $\mathcal{G}(p)$ . In order to prove this, we show that for all $t$ the sum of the weights

[TABLE]

in algorithm $\mathcal{G}_{\text{base}}(P)$ is equal to $w_{t}^{n}$ in algorithm $\mathcal{G}(p)$ . This sum corresponds to the weight that is allocated to the base expert $n\in\mathcal{N}$ as a part of the super experts’ weight allocation for step $t$ . We perform several calculations:

[TABLE]

Now note that $P(s)=P_{0}(s)=p\big{(}{N}_{T}(s)\big{)},$ and

[TABLE]

Thus, we continue computations:

[TABLE]

Let us show that the value of $n$ -independent normalizing constant $\frac{p(\bm{L}_{\mathcal{D}_{t-1}})}{P\big{(}E(\bm{L}_{\mathcal{D}_{t-1}})\big{)}}$ is equal to $1$ . Indeed,

[TABLE]

We conclude that $\widehat{w}_{t}^{n}=w_{t}^{n}$ for all $t=1,\dots T$ and $n\in\mathcal{N}$ . Thus, we proved that algorithms $\mathcal{G}_{\text{base}}(P)$ and $\mathcal{G}(p)$ have exactly the same losses. Let $H_{T}$ be the cumulative loss of these algorithms. Then, by using part 1 of the proof of the theorem we conclude:

[TABLE]

and finish the proof.∎

6 Experiments

We empirically compare developed non-replicating algorithm 3 ( $\mathcal{G}_{\text{base}}$ ) and algorithm 4 ( $\mathcal{G}_{\text{fs}}$ ) with their analogous replicated ones obtained from non-delayed Hedge and Fixed Share by using meta-algorithm BOLD [33].

To begin with, we recall the main idea of replicating meta-algorithm BOLD. For the sequence of the delays $\{D_{t}\}_{t=1}^{T}$ meta-algorithm BOLD splits the time line into disjoint subsequences. Each subsequence $\{t_{1}<\dots<t_{S}\}$ satisfies ${t_{s}+D_{t_{s}}<t_{s+1}}$ , so it is possible to run an independent copy of some non-delayed algorithm $\mathcal{A}$ on the subsequence. For simplicity we assume that all the delays $D_{t}$ are known to the BOLD beforehand. Thus, the meta-algorithm can choose the optimal learning rate for each copy of $\mathcal{A}$ depending on the length of the corresponding subsequence. For more details about algorithm BOLD please refer to the original paper [33].

We use BOLD $(\mathcal{H}_{\text{base}})$ and BOLD $(\mathcal{H}_{\text{fs}})$ to denote replicated Hedge and Fixed Share respectively.

We conduct the experiments on the artificial data. The artificial data is widely used to illustrate the performance of the Hedge-like algorithms (see [28, 11, 24, 14]).

To generate the data we use schemes similar to the ones from [24]. In all our experiments we set $N=4$ experts and use binary losses, i.e. $\{0,1\}$ . Thus, we set $H=1$ . The length of the game is $T=10000$ .

We sample $l_{t}^{n}\sim\text{Bernoulli}(q^{n})$ , i.i.d. random variables for all ${n=1,2,3,4}$ and ${t=1,2,\dots T}$ . We use two variants of $\bm{q}$ : the first one is

[TABLE]

when all the experts suffer approximately similar losses; the second one,

[TABLE]

when experts differ a lot.

The sequence of delays is random. Each $D_{t}$ is sampled from Poisson distribution with known to the learner mean $\lambda$ , i.e. $D_{t}\sim\text{Poisson}(\lambda)$ .

Note that all the computational results are averaged on ${R=250}$ random realizations of data (losses, delays) for all considered parameters ( $\bm{q},\lambda$ ).

6.1 Experiments with Hedge

In this subsection we compare non-replicating algorithm $\mathcal{G}_{\text{base}}$ and replicating algorithm BOLD $(\mathcal{H}_{\text{base}})$ .

For each copy of $\mathcal{H}_{\text{base}}$ started by BOLD on the subsequence of length $S$ we use its optimal learning rate

[TABLE]

Note that BOLD( $\mathcal{H}_{\text{base}}$ ) runs roughly $\approx[1+\mathbb{E}D]$ copies of $\mathcal{H}_{\text{base}}$ , each of length $\approx\frac{T}{[\mathbb{E}D+1]}$ with learning rate111111In the case $D_{t}\equiv D=\mathbb{E}D$ for all $t$ , all approximations become equalities.

[TABLE]

Thus, in order to equalize the learning speed of $\mathcal{G}_{\text{base}}$ and BOLD( $\mathcal{H}_{\text{base}}$ ), it is fair to assign $[1+\mathbb{E}D]$ times lower learning rate

[TABLE]

to algorithm $\mathcal{G}_{\text{base}}$ . The usage of such $\eta$ leads to $O(\sqrt{T(1+\mathbb{E}D)})$ regret bound (see Subsection 4.1).

For integer values of $\lambda=\mathbb{E}D\in[0,250]$ , we compare the total regret $R_{T}$ of $\mathcal{G}_{\text{base}}$ and BOLD( $\mathcal{H}_{\text{base}}$ ) with respect to the best expert. The resulting empirical dependence is shown in Figures 5(a), 5(b) for losses generated with the use of $\bm{q}_{1}$ and $\bm{q}_{2}$ respectively.

We discuss the results in Section 6.3 below.

6.2 Experiments with Fixed Share

In this subsection we compare non-replicating algorithm $\mathcal{G}_{\text{fs}}$ and replicating algorithm BOLD $(\mathcal{H}_{\text{fs}})$ .

We set $K=10$ switches and generate datasets which have $K$ switches of the best expert. To create such a dataset, we randomly select $K$ time steps ${t_{1}<t_{2}<\dots<T_{K}}$ . On each $k$ -th segment $[t_{k}+1,t_{k+1}]$ (for $k=0,1,\dots,K$ and $t_{0}=0,t_{K+1}=T$ ) we fix random permutation $\sigma_{k}$ on the set of $N$ elements and sample the losses of expert $n=1,2,3,4$ from Bernoulli $(q^{\sigma_{k}(n)})$ . Thus, we obtain the sequence of losses which has up to $K$ switches of the best expert. We also assume that the learner does not know $K$ in advance.

In order not to overburden the reader, we use the same learning rates as in the previous subsection. For every copy of $\mathcal{H}_{\text{fs}}$ generated by BOLD, the learning rate is defined by (20). For the $\mathcal{G}_{\text{fs}}$ the learning rate is given by (21).

We discuss the results in Section 6.3 below.

6.3 Discussion

In all Figures 5(a), 5(b), 6(a), 6(b) we see that the non-replicated algorithms outperform their corresponding replicating opponents.

For Hedge algorithm from Figures 5(a), 5(b) we also conclude that with the increase of the expected delay $\mathbb{E}D$ the gap between performance of non-replicating Hedge ( $\mathcal{G}_{\text{base}}$ ) and replicating BOLD $(\mathcal{H}_{base})$ increases. Indeed, the bigger the expected delay is, the more infrequent the separate learning processes generated by BOLD become and the less data they see. Nevertheless, while each base copy of $\mathcal{H}_{\text{base}}$ runs on $\approx\frac{1}{1+\mathbb{E}D}$ times less data than the non-replicated $\mathcal{G}_{\text{base}}$ , it uses $\approx(1+\mathbb{E}D)$ times higher learning rate, which should balance the learning speed with the non-replicated $\mathcal{G}_{\text{base}}$ .

Note that Hedge is equal to Online Mirror Descent (OMD) with Entropic Regularization (see e.g. [13]). OMD runs Online Gradient Descent (OGD)

[TABLE]

in the mirrored space $\mathbb{R}^{N}$ and after each gradient step transforms the mirrored weight $\bm{x}_{t}$ into primal weight

[TABLE]

so that $\bm{w}_{t+1}\in\Delta(\mathcal{N})$ is the decision of the algorithm on weight allocation.

In the case of i.i.d. experts losses, the mirrored estimates of $\bm{x}_{t}$ for both $\mathcal{G}_{\text{base}}$ on $t$ observations and $\mathcal{H}_{\text{base}}$ on $\approx\frac{1}{1+\mathbb{E}D}t$ observations have the same expectation. Indeed,

[TABLE]

where we use $\text{SP}(t)$ to denote the set of all time steps $\tau\leq t$ included in the separate learning process (generated by BOLD) that is used at the step $t$ . In the transition between lines (22) and (23) we use definition (21) of the learning rates. In line (24) we note that the size of the set $\text{SP}(t)$ is $\approx\frac{t}{1+\mathbb{E}D}$ .

Same as in (22)-(24), we compare the co-variance matrices of the estimates of the mirrored estimates of $\bm{x}_{t}$ obtained by $\mathcal{G}_{\text{base}}$ and BOLD. Again, using the i.i.d. assumption we derive

[TABLE]

Note that all the described co-variance matrices are diagonal because we consider the case when the losses of different experts are independent.

We see that while the expectation of the estimates of the mirrored weight $\bm{x}_{t}$ is equal for both non-replicated $\mathcal{G}_{\text{base}}$ and replicated BOLD $(\mathcal{H}_{\text{base}})$ , the variance differs $1+\mathbb{E}D$ times. In particular, this means that the distribution of mirrored weights $\bm{x}_{t}$ for these two algorithms differs. The mirrored weight of $\mathcal{G}_{\text{base}}$ is more robust than the corresponding weight of a copy of $\mathcal{H}_{\text{base}}$ . As we see from the experiments, these robustness of mirrored weight $\bm{x}_{t}$ also leads to robustness of the primal weights $\bm{w}_{t+1}$ and results in better performance.

If the data does not behave like stochastic, e.g. is maximally adversarial, the above argument obviously does not work, and the replicated algorithms may outperform their non-replicated analogues.

We also note another important advantage of the non-replicating algorithms. They are more interpretable than their replicated analogues. The weights obtained by non-replicated algorithms are smooth (thus, more interpretable), whereas the weights of replicated algorithms are smooth only inside every domain of the independent learning subprocess.

To illustrate this, we plot the weight evolution of experts obtained by $\mathcal{G}_{\text{fs}}$ and BOLD $(\mathcal{H}_{\text{fs}})$ in a single experiment with $\mathbb{E}D=40$ and $K=10$ experts’ switches with experts’ losses generated using $\bm{q}_{2}$ . The weight evolution on time interval $(4200,4300)$ is shown in Figures 7(a) and 7(b). One may clearly see that the experts’ weights of replicated algorithm in Figure 7(b) look like uninterpretable noise (because the weights of separate learning processes significantly differ).

We also attach the plot of the full weight evolution of the non-replicated algorithm in Figure 8.

To conclude, it seems that the replicated algorithms outperform non-replicated ones on the stochastic-like data. It would be interesting to obtain some concrete empirical condition on adversarial data under which the non-replicated algorithms perform better than their replicated analogues. This problem serves as the challenge for our further research.

7 Conclusion

In the article we developed the general hedging algorithm $\mathcal{G}$ (based on classical Hedge) for the delayed feedback experts’ weight allocation (see Section 3, Algorithm 2). The developed algorithm is applicable both to hedging countable and continuous sets of experts. Thanks to our main result (Theorem 1), we can bound its loss or regret with respect to the switching sequence of experts.

We described two examples of applications of algorithm $\mathcal{G}$ for delayed feedback setting. Algorithm 3 ( $\mathcal{G}_{\text{base}}$ , Subsection 4.1) is an extension of the classical Hedge for the delayed feedback. Algorithm 4 ( $\mathcal{G}_{\text{fs}}$ , Subsection 4.2) is the adaptation of classical Fixed Share. Both algorithms are non-replicated, which means that they use all the observed data to make the decision (in contrast to existing meta-approaches to delayed feedback setting).

It seems that the general probabilistic model which we described can be enhanced even more. First of all, it is reasonable to consider dynamic time-dependent learning rates $\eta_{t}$ for different time steps $t$ .121212The usual choice of dynamic learning rate in the non-delayed setting is $\eta_{t}\propto\frac{1}{\sqrt{t}}$ . This may rid the learner from choosing the learning rate beforehand. Secondly, it is possible to consider different observation probabilities (2) (or potential, see [6]). The different choice may allow to obtain the generalized versions and loss bound of Theorem 1 for many other algorithms based on multiplicative weights (e.g. MW2 [4]). The described statements serve as the challenge for our further research.

Acknowledgements

The research was partially supported by the Russian Foundation for Basic Research grant 16-29-09649 ofi m.

Appendix A Math Tools

In this appendix we describe the math tools that we use in out article. We start with the well-known Hoeffding’s & Pinsker’s inequalities and then state and prove the important Lemmas (used in the proof of our main Theorem 1).

Hoeffding’s inequality. Let $X\in[a,b]\subset\mathbb{R}$ be a random variable. Then,

[TABLE]

for all $s\in\mathbb{R}$ .

Pinsker’s inequality. Let $p(x)$ and $q(x)$ be probabilities (or densities) of $x\in X$ for two discrete (continuous) distributions over discrete (continuous) set $X\subset\mathbb{R}^{N}$ . Then

[TABLE]

where $KL(p||q)$ is Kullback–Leibler divergence between $p$ and $q$ .

The following technical Lemma plays an important role in the proof of Theorem 1 (Section 5).

Lemma 1.

Let $X\subset\mathbb{R}^{N}$ be a countable (or continuous) set. Let $p(x)$ and $q(x)$ denote probabilities (or densities) of two random variables with values in $X$ . Let $a:X\rightarrow\mathbb{R}$ be a measurable function such that for all $x\in X$ we have $-\frac{1}{\eta}\ln a(x)\in[0,C]$ . Then if $q(x)\propto p(x)\cdot a(x)$ , the following holds true131313In the continuous case the sum should be replaced by the integral.

[TABLE]

Proof.

Apply Pinsker’s inequality 26 for $p(\cdot)$ and $q(x)$ and obtain

[TABLE]

Note that $q(x)=\frac{p(x)a(x)}{\sum_{x^{\prime}\in X}p(x^{\prime})a(x^{\prime})}$ . We compute the divergence

[TABLE]

where in (29) we denote $l^{x}=-\frac{1}{\eta}\ln a(x)\in[0,C]$ (for $x\in X$ ) and use Hoeffding’s inequality (25) for variable which is equal to $l^{x}$ w.p. $p(x)$ . To finish, we obtain the bound (27) by combining (28) with the upper bound (29). ∎

Lemma 2.

Let $T>0$ be an integer and $\{D_{t}\}_{t=1}^{T}$ be the sequence of integer delays such that $t+D_{t}\leq T$ . Let $\mathcal{D}_{t}=\{\tau|\tau+D_{\tau}\leq t\}$ . Then

[TABLE]

Proof.

Note that all $\mathcal{D}_{\tau}$ for $\tau\geq t+D_{t}$ contain $t$ . Thus,

[TABLE]

Since $|\mathcal{D}_{T}|=T$ and $D_{T}=0,$ , the obtained expression is equivalent to desired equality (30). ∎

Lemma 3.

Let ${N}_{T}$ be the sequence of experts $(n_{1},n_{2},\dots,n_{T})\in\mathcal{N}^{T}$ , where $\mathcal{N}=\{1,2,\dots,N\}$ . Let $p(\cdot)$ be the probabilistic model used in Fixed Share with prior $p_{0}\equiv\frac{1}{N}$ and switch probabilities $\alpha_{t}=\frac{1}{t}$ for all $t=2,\ldots,T$ . Then

[TABLE]

where $|d{N}_{T}|=|\{t:\,n_{t}\neq n_{t-1}\}|$ is the number of expert switches in $N_{T}$ .

Proof.

Simple calculations

[TABLE]

prove the lemma. ∎

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Littlestone and Warmuth [1994] N. Littlestone, M. K. Warmuth, The Weighted Majority Algorithm, Inf. Comput. 108 (2) (1994) 212–261, ISSN 0890-5401, URL http://dx.doi.org/10.1006/inco.1994.1009 . · doi ↗
2Freund and Schapire [1997] Y. Freund, R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences 55 (1) (1997) 119 – 139, ISSN 0022-0000, URL http://www.sciencedirect.com/science/article/pii/S 002200009791504 X .
3Devroye et al. [2013] L. Devroye, G. Lugosi, G. Neu, Prediction by random-walk perturbation, in: Conference on Learning Theory, 460–473, 2013.
4Cesa-Bianchi et al. [2007] N. Cesa-Bianchi, Y. Mansour, G. Stoltz, Improved second-order bounds for prediction with expert advice, Machine Learning 66 (2-3) (2007) 321–352.
5Kotłowski [2018] W. Kotłowski, On minimaxity of follow the leader strategy in the stochastic setting, Theoretical Computer Science 742 (2018) 50–65.
6Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi, G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, New York, NY, USA, ISBN 0521841089, 2006.
7Vovk [1990] V. G. Vovk, Aggregating Strategies, in: Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT ’90, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ISBN 1-55860-146-5, 371–386, URL http://dl.acm.org/citation.cfm?id=92571.92672 , 1990.
8Vovk [1998] V. Vovk, A Game of Prediction with Expert Advice, J. Comput. Syst. Sci. 56 (2) (1998) 153–173, ISSN 0022-0000, URL http://dx.doi.org/10.1006/jcss.1997.1556 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Adaptive Hedging under Delayed Feedback

Abstract

keywords:

1 Introduction

2 Preliminaries

3 Generalized Hedging Algorithm

3.1 Probabilistic Framework

3.2 General Hedging Algorithm

3.3 Guarantees of Performance

Theorem 1** (Adversarial loss bound for algorithm G\mathcal{G}G).**

Corollary 1** (Adversarial regret bound for algorithm G\mathcal{G}G).**

Proof.

4 Examples

4.1 Basic Delayed Exponential Weights (Hedge)

4.1.1 Algorithm

4.1.2 Regret bound

4.2 Adaptive Delayed Exponential Weights (Fixed Share)

4.2.1 Equivalence to Fixed Share in the Non-Delayed Setting

4.2.2 Algorithm for the Delayed Setting

4.2.3 Regret Bound

5 Proof of Performance

5.1 Bound for Non-delayed Setting

Proof.

5.2 Bound for Delayed Setting

Proof.

6 Experiments

6.1 Experiments with Hedge

6.2 Experiments with Fixed Share

6.3 Discussion

7 Conclusion

Acknowledgements

Appendix A Math Tools

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Theorem 1 (Adversarial loss bound for algorithm $\mathcal{G}$ ).

Corollary 1 (Adversarial regret bound for algorithm $\mathcal{G}$ ).

Lemma 1.

Lemma 2.

Lemma 3.