Adaptive Hedging under Delayed Feedback
Alexander Korotin, Vladimir V'yugin, Evgeny Burnaev

TL;DR
This paper introduces a new adaptive hedging algorithm for online expert weight allocation that effectively handles delayed feedback, extending classical algorithms and providing theoretical regret bounds in adversarial settings.
Contribution
We develop the General Hedging algorithm $$ based on exponential reweighing, extending classical Hedge and Fixed Share algorithms to delayed feedback scenarios.
Findings
The algorithm $$ achieves adversarial loss bounds under delay.
It extends classical Hedge and Fixed Share algorithms to delayed feedback.
Provides regret bounds for both countable and continuous expert sets.
Abstract
The article is devoted to investigating the application of hedging strategies to online expert weight allocation under delayed feedback. As the main result, we develop the General Hedging algorithm based on the exponential reweighing of experts' losses. We build the artificial probabilistic framework and use it to prove the adversarial loss bounds for the algorithm in the delayed feedback setting. The designed algorithm can be applied to both countable and continuous sets of experts. We also show how algorithm extends classical Hedge (Multiplicative Weights) and adaptive Fixed Share algorithms to the delayed feedback and derive their regret bounds for the delayed setting by using our main result.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Adaptive Hedging under Delayed Feedback
Alexander Korotin
Vladimir V’yugin
Evgeny Burnaev
Skolkovo Institute of Science and Technology
Abstract
The article is devoted to investigating the application of hedging strategies to online expert weight allocation under delayed feedback. As the main result we develop the General Hedging algorithm based on the exponential reweighing of experts’ losses. We build the artificial probabilistic framework and use it to prove the adversarial loss bounds for the algorithm in the delayed feedback setting. The designed algorithm can be applied to both countable and continuous sets of experts. We also show how algorithm extends classical Hedge (Multiplicative Weights) and adaptive Fixed Share algorithms to the delayed feedback and derive their regret bounds for the delayed setting by using our main result.
keywords:
hedging , decision-theoretic online learning , experts problem , delayed feedback , adaptive algorithms , non-replicating algorithms , adversarial setting.
††journal: Neurocomputing
1 Introduction
We consider the Decision-Theoretic Online Learning (DTOL) framework [1, 2, 3, 4, 5, 3] which is closely related to the paradigm of prediction with expert advice [6, 7, 8, 9, 1, 10, 11, 12]. A master algorithm at every step of the game has to choose the weight allocation for a given pool of expert strategies (experts). We call this problem the experts problem. We investigate the adversarial case, i.e., no assumptions are made about the nature of the data (stochastic, deterministic, etc.).
The performance of the master algorithm is measured by the regret over the entire game. The regret is the difference between the cumulative loss of the online algorithm and the loss of some given comparator. A typical comparator is the best fixed expert in the pool or the best fixed convex linear combination of experts. The goal of the algorithm is to minimize the regret, i.e., .
In the classical online learning, the algorithm suffers loss of its decision at each step at the end of the same step the decision is made. In contrast to the classical scenario, we consider the delayed feedback learning. At each step of the game the algorithm makes a decision, and its result will be revealed only at the end of a time point (where is some delay).
It turns out that there exists a wide range of algorithms for the non-delayed scenario (). Almost all of them exploit the follow-the-best-expert idea: the better expert performed in the past, the higher relative weight is assigned to the expert. The pure Follow the Leader (FTL) strategy is well-known to have good performance in the stochastic setting111Pure Follow the Leader strategy is known to be the minimax in the simplest stochastic setting (experts’ losses are i.i.d. between experts and time steps). [5], but it may be inefficient when the data is generated by an adversary (see discussion in [13, 14]).
Follow the Perturbed Leader (FTPL) algorithm [15] adds random noise to expert evaluation process. This prevents overfitting in adversarial setting. For example, exponential [15], random-walk [3], and dropout [16] noise has been shown to achieve low expected regret for the experts problem.
Follow the Regularized Leader222Equivalently, Online Mirror Descent, see [17]. (FTRL) is a powerful algorithm from online convex optimization framework [18], [13]. The usage of the linear loss function on a simplex allows to deal with the experts problem. The quadratic regularization leads to Online Gradient Descent (OGD) algorithm [13], the Entropic regularization provides Exponential Weights algorithm, also known as Hedge [2].
The idea of multiplicative weight updates (MW) of Hedge algorithm is used in many successive algorithms (MW2 [4], Variation-MW [19], Optimistic-MW [20], AEG-Path and AMEG-Path [21] and other algorithms [22, 23]). The main goal of such algorithms is to obtain the first or the second order regret bound (e.g. in terms of best expert’s loss) or achieve improvement for easy-data. Also, some Hedge-based algorithms (AdaHedge [24], Flip-Flop [14]) are designed to be parameter-free.
Almost all described algorithms provide adversarial regret guarantees w.r.t. the best expert in the pool. Note that this bound is minimax optimal up to some multiplicative factor because is known to be the lower bound [6].333More precisely, the lower bound is , where is the number of experts in the finite pool.
An important variant of the experts problem is to develop an adaptive master algorithm. Such an algorithm has to track the shifts (switches) of the best expert and achieve low tracking regret with respect to shifting sequences of experts.444Sometimes in online learning the term adaptive means that the algorithm dynamically changes its learning rate during the game. Please do not get confused. There are many meta-approaches such as restarts [25, 26] or specialist experts [27] to create adaptive algorithms from non-adaptive ones. However, the most recognizable approach is to use the Fixed Share extension for Hedge [28, 29, 25, 11].
When it comes to the delayed feedback setting, many of the above described non-delayed algorithms do not have theoretical guarantees of performance or do not even have a modification for the delayed feedback setting.
There exists a bunch of meta-algorithms that allow to produce a version for delayed feedback setting from the basic non-delayed version [30, 31, 32, 33]. The roots of meta-approach lie in the work [30]. The authors studied the setting under fixed known feedback delay . They proved that the optimal (non-adaptive) algorithm is to run independent versions of the optimal non-delayed algorithm on disjoint time grids for . Thus, the optimal worst-case adversarial regret is . The described meta-approach was enhanced for the unknown and dynamic feedback delay in [33]. Their meta-algorithm BOLD (Black-box Online Learning with Delays) also runs independent copies of the basic algorithm on disjoint time lines.
We call algorithms obtained by meta-approaches (such as BOLD) replicated algorithms. Whereas replicating is simple and in some cases is theoretically optimal, it has several obvious practical drawbacks. Firstly, it uses only part of the observed data at every step of the game. Secondly, separate replicating learning processes generated by the meta-algorithm do not even interact.
Non-adaptive algorithms based on FTRL and FTPL have several non-replicated adaptations for delayed feedback setting. The most straightforward ones are Delayed OGD [34], Delayed FTPL and FTRL [35] and FTRL with Memory [36]. For the fixed and known feedback delay their best regret bound is , which is optimal.
In this work, we aim to create an adaptive non-replicated algorithm for the delayed feedback setting. We base our research on the Hedge algorithm (and its adaptive extension Fixed Share), which is the state-of-the-art basis for many existing algorithms. In order to achieve the desired goal, we develop the general probabilistic framework for Hedge-based algorithms. Using this framework, we propose the General Hedging Algorithm , prove its loss bounds both for delayed and non-delayed cases. As a corollary of the main result, we show how classical non-delayed Hedge and Fixed Share algorithms (as the cases of ) can be extended to the delayed feedback setting and what regret bounds they have.
The main contributions of this paper are:
Developing the General Hedging algorithm for the delayed feedback scenario which is applicable to both non-delayed and delayed online settings. Proving the algorithm’s loss bound (and regret bound, for the case of a countable set of experts) in a general form. 2. 2.
Developing (for a finite number of experts) non-replicated versions of basic Hedge [2] and adaptive Fixed Share [28] algorithms (as special cases of algorithm ) for the delayed feedback scenario as well as deriving their regret bounds.
The General Hedging algorithm which we develop is motivated by the paper [10]. In that work the authors considered the special case of the prediction with experts’ advice with the logarithmic loss function. For the traditional non-delayed scenario () they developed the Bayesian Merging Algorithm for mixing (averaging) experts’ predictions. Their algorithm is based on the natural graphical model (similar to the one in Figure 1 of Section 3) implied by the probabilistic origin of the logarithmic loss function.
In contrast to [10], we consider the decision-theoretic online learning scenario (hedging), which is more general than prediction with experts’ advice.555Hedging scenario assumes that the learner has access only to losses of experts while in prediction with experts’ advice the learner knows experts’ predictions and observes true outcomes (the losses are computed by using the known loss function). Prediction with experts’ advice can be reduced to Hedging by forgetting about the expert’s predictions and using only the computed losses of the experts. At the same time we investigate both non-delayed and delayed feedback settings. We build the artificial probabilistic framework for arbitrary bounded losses by using the entropithication transform (loss exponentiation, see e.g. [37, 38]), state the General Hedging algorithm and prove its loss bound.
The article is structured as follows:
In Section 2 we give preliminary notions, describe the notation and the setting of the game of the delayed feedback experts’ weights allocation.
In Section 3 we describe the developed probabilistic framework, the main algorithm , and formulate the main Theorem 1 about its loss bound. In Section 5 we prove the main theorem.
In Section 4 we provide the examples of the application of algorithm : Delayed Hedge in Subsection 4.1, Delayed Fixed Share in Subsection 4.2.
In Section 6 we conduct massive computational experiments and provide the detailed discussion of the results.
In A we provide the necessary mathematical background.
2 Preliminaries
We use bold font to denote vectors (e.g. for some integer ). In most cases, superscript is used for indexing elements of a vector (e.g. ). Subscript is always used to indicate time (e.g. ).
We consider the online game of delayed hedging of a (finite or infinite) pool of experts. We use to denote the pool and as an index of an expert. In this paper is either a discrete set (e.g. ) or a continuous subset of Euclidean space (e.g. ). By for a discrete (continuous) set we denote all discrete (continuous) probability distributions on .
For convenience, we do all calculations in the paper assuming that is a discrete countable set. All the results also hold true for the continuous but sums over (e.g. ) should be replaced with corresponding intergrals (e.g. ).
At each integer time step of the game the master (hedging) algorithm has to assign the weights to all experts so that
[TABLE]
At the end of the step (for integer experts reveal their losses at the step . The loss of the algorithm’s decision of the step is
[TABLE]
i.e., the average experts’ loss w.r.t. .
The sequence is called the sequence of delays. For simplicity, we assume that for all . In particular, . We denote the set of all time indices of the losses revealed before the end of the step by . Also, we denote .
There are many scenarios on how the sequence is chosen (randomly, adversarially) and whether it is known to the learner in advance or not (see e.g. [30, 31, 32, 39, 33]). Yet, we do not specify the particular scenario, and consider the game in the general form.
In this work we assume that all the losses are bounded: for all and . This is a common assumption in online learning (see [13, 18] or any other survey on online learning). The game setting is described by the following Protocol 1.
We use and to denote the cumulative (total) loss of the algorithm and expert .
The performance of the algorithm is measured by the (cumulative) regret. The regret is the difference between the cumulative loss of the algorithm and the cumulative loss of some given comparator. A typical approach is to compete with the best expert in the pool. The cumulative regret with respect to the best expert is
[TABLE]
The goal of the algorithm is to minimize the regret, i.e., . In order to theoretically guarantee algorithm’s performance, some upper bound is usually proved for the cumulative regret .
In the basic setting (1), sub-linear upper bound for the regret leads to the asymptotic performance of the algorithm equal to the performance of the best expert. More precisely, we have .
3 Generalized Hedging Algorithm
In this section we describe the generalization of the classical hedging algorithm based on exponential reweighing of experts’ losses. The basic algorithm was introduced by [2].
We investigate the adversarial case, i.e., no assumptions (stochastic, functional, etc.) are made about the nature of data (experts’ losses). However, it turns out that in this case it is convenient to develop algorithms using some probabilistic framework.
3.1 Probabilistic Framework
Recall that is a dictionary of experts’ losses at the step . The framework that we build implies that data is generated by some probabilistic model with hidden states. The graphical model is shown in Figure 1.
We suppose that there is some hidden sequence of experts (for ) that generates the experts’ losses . In particular, hidden expert at step is called active expert. The conditional probability to observe the vector of experts’ losses at the step is
[TABLE]
where is some fixed learning rate and is the normalizing constant. Constant is independent of both and . The idea of conditional probability (2) is to assume that if at the step expert is active, then the loss vector is not completely random, i.e. loss is random, while all the other components are deterministic (e.g. given by nature).666Another definition of conditional probability is also possible. All the elements can be considered as independent random variables. If expert is active, the probability of observing is equal to the current right-hand side of equation (2). All the other losses are i.i.d. uniform variables on . For the case of finite the formula (2) is replaced by
p(\bm{l}_{t}|n_{t})=p(l_{t}^{n_{t}}|n_{t})\times\bigg{[}\prod_{n\neq n_{t}}p(l_{t}^{n}|n_{t})\bigg{]}=\frac{e^{-\eta l_{t}^{n_{t}}}}{Z}\times\frac{1}{H^{N-1}},
(3)
i.e. has an additional denominating factor of . However, for the infinite number of experts this approach requires a more detailed specification of probabilities in terms of measures, because the denominator becomes infinite. In Section 5 we will see that the exact value of the normalization constant is important neither for the algorithm, nor for its regret bound. Thus, for convenience it is reasonable to consider the model (2).
For the first active expert some known prior distribution is given . The sequence of active experts is generated step by step. For each is sampled from some known distribution , where .777In case , we obtain a traditional Hidden Markov Process: the hidden state at step depends only on the previous hidden state at step . Thus, active expert depends on the previous experts .
For every sequence of experts we denote the cumulative loss of the sequence by
[TABLE]
For all we define the following lists of loss vectors:
[TABLE]
The considered probabilistic model is:
[TABLE]
The probability is that of hidden states (active experts).888The form is used only for convenience and association with online scenario. It does not impose any restrictions on the type of probability distribution. In fact, may be any distribution on of any form.
3.2 General Hedging Algorithm
The hedging algorithm 2 is shown below. We denote it by ( stands for General), where indicates the probability distribution of active experts to which the algorithm is applied.
The idea of the algorithm is simple: set the weight allocation for the current step according to the posterior probability of the expert computed from the underlying probabilistic model. We illustrate this idea in Figure 2.
Consider a finite pool and set for all . Consider the non-delayed scenario ( for all ). If we use , the experts’ weights become . The resulting algorithm turns to be classical non-delayed Hedge (for more detailed discussion see Subsection 4.1). Also, non-delayed Fixed Share is the case of for specially chosen Markovian (see Subsection 4.2).
The time and memory complexities of the algorithm depend on the properties of the underlying distribution . For Markovian models (when hidden state depends only on the previous state for all ) it is possible to provide linear in schemes to compute weights (see Subsection 4.2) which require memory. For the arbitrary time and memory complexity may be even exponential.
3.3 Guarantees of Performance
The algorithm has theoretical guarantees of performance. We state the following main theorem.
Theorem 1** (Adversarial loss bound for algorithm ).**
Let be a countable (or continuous) set of experts. Let be a discrete (or continuous) distribution on . Then for the hedging algorithm applied to model with learning rate the following upper bound for the total loss over the entire game holds true:
[TABLE]
The proof of this theorem is given in Section 5. Note that while the algorithm may seem to be designed for the stochastic setting, we apply it to the pure adversarial case999The only assumption is that the losses are bounded, i.e. for all and . and obtain the loss guarantees. At the same time, the adversarial loss bound (5) depends on the probability distribution for which the algorithm is designed.
One may wonder how Theorem 1 is applied to estimate the regret, for example, the regret with respect to the best expert (1). If the set of experts is countable, then the following simple corollary holds true.
Corollary 1** (Adversarial regret bound for algorithm ).**
If the set of experts is countable, then under the conditions of Theorem 1, the regret with respect to any sequence is
[TABLE]
Proof.
The corollary results from the following inequality for the expectation in the right-hand side of (5):
[TABLE]
which leads to the desired bound.∎
If is continuous under the conditions of Theorem 1, then the first term in the upper bound (5) is represented by the integral (instead of a countable sum). It is not possible to extract a single summand as in the finite case. However, sometimes the expectation can be directly computed or estimated w.r.t. the loss of the best expert in the pool. For example, see approaches of [40, 41, 42] applied to Online Kernel Regression.
The regret bound 6 is a linear function of game length . Nevertheless, if the game length is known in advance, one may achieve sub-linear regret bound by choosing the learning rate to be dependent on game length . Particular examples of learning rates for specific underlying distributions are provided in following Section 4.
4 Examples
In this Section we provide the examples of useful underlying probability models and use them to apply the algorithm to construct online expert weight allocation algorithms. We consider a finite pool of experts .
4.1 Basic Delayed Exponential Weights (Hedge)
Consider the following underlying probability . Let be some prior and for . This means that the hidden active expert does not change during the game. We denote the corresponding algorithm applied to by . The corresponding graphical model is shown in Figure 3.
It is easy to see that for all the weight allocation
[TABLE]
is proportional to the observed losses of the expert . If there are no delays ( for all ), then the algorithm becomes classical Hedge by [2].
4.1.1 Algorithm
The pseudo-code of Algorithm 3 () is shown below. In the code we assume that the operation Output() sets the weight allocation () for the current step. Function GetRevealedLosses() obtains all the vectors of losses of the steps in the form of an iterable list of pairs .
The algorithm requires memory and time complexity.
4.1.2 Regret bound
According to Corollary 1, the regret of the algorithm with respect to any fixed expert is bounded:
[TABLE]
The typical prior is . For this basic case in the non-delayed feedback setting () the is chosen in advance (with prior knowledge of ) to minimize the regret. The optimal choice is , which results in classical regret.
However, the choice of optimal in the delayed setting highly depends on how the sequence of delays is generated. If the learner knows in advance or is sampled from some distribution with known expectation , the optimal choice is
[TABLE]
respectively. This choice results in and regret bounds respectively.
If the sequence of delays is chosen by an adversary, the classical choice results in regret, where .
4.2 Adaptive Delayed Exponential Weights (Fixed Share)
Consider the following underlying probability . Let be some prior and
[TABLE]
for and sequence . This means that the hidden active expert changes to random (according to prior ) between steps and with some small probability .
We denote the corresponding algorithm applied to by . The graphical model is shown in Figure 4.
The sequence can be arbitrary. However, the classical approach is to use (see [28, 25, 29]), because in this special case the regret bound is better (than e.g. in the case ). In our case at the end of the subsection we will also use the sequence when estimating the regret.
4.2.1 Equivalence to Fixed Share in the Non-Delayed Setting
To begin with, we examine the application of algorithm to the described probabilistic model in the non-delayed case, i.e., for all . In the non-delayed case we have for all . Thus, for all we get
[TABLE]
We set , for all and . We get
[TABLE]
Combining (8) with (7) we see that
[TABLE]
On the other hand,
[TABLE]
Thus,
[TABLE]
Formulas (9) and (10) mean that the algorithm’s decision can be iteratively updated step by step by using the additional weight . The obtained weight updates (9) and (10) exactly match the updates of the Fixed Share algorithm by [28]. Thus, in the non-delayed case is equal to Fixed Share.
4.2.2 Algorithm for the Delayed Setting
Now we examine the algorithm under the setting of the delayed feedback, i.e., for all delay is some non-negative integer value.
For all and we use to denote the set of all time steps such that the loss vector is revealed not later than the step . Formally, we define
[TABLE]
In the next few paragraphs we describe the efficient scheme to recompute the algorithms decision at every step .
Suppose that at the beginning of the step we keep all the probabilities for all and . We denote the corresponding -dimensional probability vectors by . We also denote . Similar to (8) calculations lead to the simple formula that allows to obtain :
[TABLE]
After the decision on is made, the algorithm obtains losses of steps . Thus, we need to calculate new probability vector with coordinates . Moreover, we have to update all vectors for from to .
Let . Note that all for do not require being updated because . Next, for we recompute the vectors iteratively.
We explain how to compute below (assuming that previous is already computed). For convenience, we introduce the temporary vector variable , where for all . First, we express using . Next, we express using .
We deduce the formula to compute by using :
[TABLE]
In line (11) we exploit the fact that the elements of are strictly lower than . In line (12) we use the definition (7) of the transition probability. The vector form of (13) is
[TABLE]
To derive using we consider two cases: and . In the first case , which leads to . If , we have
[TABLE]
The vector form of expression (14) is
[TABLE]
The pseudo-code of algorithm 4 () is shown below. In addition to the notations of Algorithm 3 (), we assume that an extra function GetSwitchProbability() provides the value of the current switch probability (which may be chosen online).
The List() class corresponds to the dynamic array. We assume that it has integer index , supports the append-to-right operation in time. We also assume that all operations to get or set list element (by index) require time. At the end of each step list keeps the posterior probabilities described above.
The time complexity of the algorithm is bounded by
[TABLE]
Indeed, at the steps such that the algorithm performs operations. At other steps the algorithm performs
[TABLE]
operations which are bounded by for the minimal .
The memory complexity of the algorithm is . However, it is possible to significantly reduce the memory complexity. Note that if for some we have , the weights will never be used or recomputed after the step . Thus, they become useless, and it is meaningful to keep only elements with (same for lists and ). The reduction will result in memory complexity. We did not include the explained trick in the pseudo-code of Algorithm 4 in order to keep it simple.
4.2.3 Regret Bound
We use . We combine Corollary 1 with Lemma 3 and obtain the regret bound for the algorithm with respect to any switching sequence :
[TABLE]
where is the number of expert’s switches in .
Similar to the non-adaptive case, the algorithm requires choosing optimal learning rate in order to minimize the regret bound. The optimal should be chosen with respect to and .101010It is also possible to minimize the bound w.r.t. particular number of switches . The following discussion is similar to the one at the end of the previous subsection 4.1.
If the learner knows beforehand or is sampled from some distribution with known expectation , the choice of
[TABLE]
respectively results in
[TABLE]
and
[TABLE]
(expected) regret bound with respect to any sequence with no more than expert switches.
If the sequence of delays is chosen by an adversary and unknown to the learner, then classical choice results in regret, where .
5 Proof of Performance
In this section we prove Theorem 1. The proof is complicated, and we split it into two sequential parts. Firstly, we prove the bound (5) for the non-delayed case in Subsection 5.1, i.e., . Secondly, we obtain the bound (5) for arbitrary sequence of delays in Subsection 5.2.
5.1 Bound for Non-delayed Setting
We set for all and deal with the bound for algorithm in this case. Note that for all and .
Proof.
Recall that . Define the mixloss at the step :
[TABLE]
Define the cumulative mixloss over the entire game:
[TABLE]
For all we apply Hoeffding’s inequality (25) to a random variable
[TABLE]
where :
[TABLE]
which is equal to
[TABLE]
We sum (17) for and obtain
[TABLE]
which finishes the proof.∎
5.2 Bound for Delayed Setting
In this section we consider the case of arbitrary sequence of delays .
Proof.
We use the superscript to denote the variables obtained by algorithm (for example, weights , etc.) with the sequence of delays . Our main idea is to prove that the weights are approximately equal to the weights obtained by the algorithm in the game with the same experts but with no delays, i.e., . Thus, the losses and will be approximately equal.
We divide this part of the proof of the theorem into two steps:
Step 1. Proof for a simple probability distribution
To begin with, we consider the case of a simple Hidden Markov Model . Let and for all . The corresponding algorithm is .
We compare the losses and of algorithm applied to the same data with no delays and with the given sequence of delays respectively.
[TABLE]
Note that and for all . This means that , where
[TABLE]
Thus, according to Lemma 1, we obtain the bound
[TABLE]
for all . Combining it with (19) and Lemma 30 we obtain:
[TABLE]
The final step is to combine current result with the loss bound (18) for the non-delayed case:
[TABLE]
and finish the proof of the bound for algorithm .
Step 2. Proof for an arbitrary probability distribution
Now we consider the case of an arbitrary probability distribution . From the given set of experts we create a new super set of super experts ( for Super). Each super expert corresponds to some sequence of basic experts of length . We denote the -th component of super expert by . We denote the full sequence of experts corresponding to by . We do not use subscript in order not to overburden the notation. The loss of super expert at the step is , where (for ) are the losses of basic experts. We use and to denote all the super experts’ losses at the steps and respectively ( for Enhanced).
We define the probability model for hidden super experts. In order not to confuse the reader with notation, we use capital (instead of regular ) to denote all probabilities related to super experts. Let and
[TABLE]
The described probability distribution corresponds to algorithm for super experts and initial distribution . We have w.p. 1.
The main idea is to show that the losses of algorithm are equal to the losses of algorithm . In order to prove this, we show that for all the sum of the weights
[TABLE]
in algorithm is equal to in algorithm . This sum corresponds to the weight that is allocated to the base expert as a part of the super experts’ weight allocation for step . We perform several calculations:
[TABLE]
Now note that P(s)=P_{0}(s)=p\big{(}{N}_{T}(s)\big{)}, and
[TABLE]
Thus, we continue computations:
[TABLE]
Let us show that the value of -independent normalizing constant \frac{p(\bm{L}_{\mathcal{D}_{t-1}})}{P\big{(}E(\bm{L}_{\mathcal{D}_{t-1}})\big{)}} is equal to . Indeed,
[TABLE]
We conclude that for all and . Thus, we proved that algorithms and have exactly the same losses. Let be the cumulative loss of these algorithms. Then, by using part 1 of the proof of the theorem we conclude:
[TABLE]
and finish the proof.∎
6 Experiments
We empirically compare developed non-replicating algorithm 3 () and algorithm 4 () with their analogous replicated ones obtained from non-delayed Hedge and Fixed Share by using meta-algorithm BOLD [33].
To begin with, we recall the main idea of replicating meta-algorithm BOLD. For the sequence of the delays meta-algorithm BOLD splits the time line into disjoint subsequences. Each subsequence satisfies , so it is possible to run an independent copy of some non-delayed algorithm on the subsequence. For simplicity we assume that all the delays are known to the BOLD beforehand. Thus, the meta-algorithm can choose the optimal learning rate for each copy of depending on the length of the corresponding subsequence. For more details about algorithm BOLD please refer to the original paper [33].
We use BOLD and BOLD to denote replicated Hedge and Fixed Share respectively.
We conduct the experiments on the artificial data. The artificial data is widely used to illustrate the performance of the Hedge-like algorithms (see [28, 11, 24, 14]).
To generate the data we use schemes similar to the ones from [24]. In all our experiments we set experts and use binary losses, i.e. . Thus, we set . The length of the game is .
We sample , i.i.d. random variables for all and . We use two variants of : the first one is
[TABLE]
when all the experts suffer approximately similar losses; the second one,
[TABLE]
when experts differ a lot.
The sequence of delays is random. Each is sampled from Poisson distribution with known to the learner mean , i.e. .
Note that all the computational results are averaged on random realizations of data (losses, delays) for all considered parameters ().
6.1 Experiments with Hedge
In this subsection we compare non-replicating algorithm and replicating algorithm BOLD.
For each copy of started by BOLD on the subsequence of length we use its optimal learning rate
[TABLE]
Note that BOLD() runs roughly copies of , each of length with learning rate111111In the case for all , all approximations become equalities.
[TABLE]
Thus, in order to equalize the learning speed of and BOLD(), it is fair to assign times lower learning rate
[TABLE]
to algorithm . The usage of such leads to regret bound (see Subsection 4.1).
For integer values of , we compare the total regret of and BOLD() with respect to the best expert. The resulting empirical dependence is shown in Figures 5(a), 5(b) for losses generated with the use of and respectively.
We discuss the results in Section 6.3 below.
6.2 Experiments with Fixed Share
In this subsection we compare non-replicating algorithm and replicating algorithm BOLD.
We set switches and generate datasets which have switches of the best expert. To create such a dataset, we randomly select time steps . On each -th segment (for and ) we fix random permutation on the set of elements and sample the losses of expert from Bernoulli. Thus, we obtain the sequence of losses which has up to switches of the best expert. We also assume that the learner does not know in advance.
In order not to overburden the reader, we use the same learning rates as in the previous subsection. For every copy of generated by BOLD, the learning rate is defined by (20). For the the learning rate is given by (21).
We discuss the results in Section 6.3 below.
6.3 Discussion
In all Figures 5(a), 5(b), 6(a), 6(b) we see that the non-replicated algorithms outperform their corresponding replicating opponents.
For Hedge algorithm from Figures 5(a), 5(b) we also conclude that with the increase of the expected delay the gap between performance of non-replicating Hedge () and replicating BOLD increases. Indeed, the bigger the expected delay is, the more infrequent the separate learning processes generated by BOLD become and the less data they see. Nevertheless, while each base copy of runs on times less data than the non-replicated , it uses times higher learning rate, which should balance the learning speed with the non-replicated .
Note that Hedge is equal to Online Mirror Descent (OMD) with Entropic Regularization (see e.g. [13]). OMD runs Online Gradient Descent (OGD)
[TABLE]
in the mirrored space and after each gradient step transforms the mirrored weight into primal weight
[TABLE]
so that is the decision of the algorithm on weight allocation.
In the case of i.i.d. experts losses, the mirrored estimates of for both on observations and on observations have the same expectation. Indeed,
[TABLE]
where we use to denote the set of all time steps included in the separate learning process (generated by BOLD) that is used at the step . In the transition between lines (22) and (23) we use definition (21) of the learning rates. In line (24) we note that the size of the set is .
Same as in (22)-(24), we compare the co-variance matrices of the estimates of the mirrored estimates of obtained by and BOLD. Again, using the i.i.d. assumption we derive
[TABLE]
Note that all the described co-variance matrices are diagonal because we consider the case when the losses of different experts are independent.
We see that while the expectation of the estimates of the mirrored weight is equal for both non-replicated and replicated BOLD, the variance differs times. In particular, this means that the distribution of mirrored weights for these two algorithms differs. The mirrored weight of is more robust than the corresponding weight of a copy of . As we see from the experiments, these robustness of mirrored weight also leads to robustness of the primal weights and results in better performance.
If the data does not behave like stochastic, e.g. is maximally adversarial, the above argument obviously does not work, and the replicated algorithms may outperform their non-replicated analogues.
We also note another important advantage of the non-replicating algorithms. They are more interpretable than their replicated analogues. The weights obtained by non-replicated algorithms are smooth (thus, more interpretable), whereas the weights of replicated algorithms are smooth only inside every domain of the independent learning subprocess.
To illustrate this, we plot the weight evolution of experts obtained by and BOLD in a single experiment with and experts’ switches with experts’ losses generated using . The weight evolution on time interval is shown in Figures 7(a) and 7(b). One may clearly see that the experts’ weights of replicated algorithm in Figure 7(b) look like uninterpretable noise (because the weights of separate learning processes significantly differ).
We also attach the plot of the full weight evolution of the non-replicated algorithm in Figure 8.
To conclude, it seems that the replicated algorithms outperform non-replicated ones on the stochastic-like data. It would be interesting to obtain some concrete empirical condition on adversarial data under which the non-replicated algorithms perform better than their replicated analogues. This problem serves as the challenge for our further research.
7 Conclusion
In the article we developed the general hedging algorithm (based on classical Hedge) for the delayed feedback experts’ weight allocation (see Section 3, Algorithm 2). The developed algorithm is applicable both to hedging countable and continuous sets of experts. Thanks to our main result (Theorem 1), we can bound its loss or regret with respect to the switching sequence of experts.
We described two examples of applications of algorithm for delayed feedback setting. Algorithm 3 (, Subsection 4.1) is an extension of the classical Hedge for the delayed feedback. Algorithm 4 (, Subsection 4.2) is the adaptation of classical Fixed Share. Both algorithms are non-replicated, which means that they use all the observed data to make the decision (in contrast to existing meta-approaches to delayed feedback setting).
It seems that the general probabilistic model which we described can be enhanced even more. First of all, it is reasonable to consider dynamic time-dependent learning rates for different time steps .121212The usual choice of dynamic learning rate in the non-delayed setting is . This may rid the learner from choosing the learning rate beforehand. Secondly, it is possible to consider different observation probabilities (2) (or potential, see [6]). The different choice may allow to obtain the generalized versions and loss bound of Theorem 1 for many other algorithms based on multiplicative weights (e.g. MW2 [4]). The described statements serve as the challenge for our further research.
Acknowledgements
The research was partially supported by the Russian Foundation for Basic Research grant 16-29-09649 ofi m.
Appendix A Math Tools
In this appendix we describe the math tools that we use in out article. We start with the well-known Hoeffding’s & Pinsker’s inequalities and then state and prove the important Lemmas (used in the proof of our main Theorem 1).
Hoeffding’s inequality. Let be a random variable. Then,
[TABLE]
for all .
Pinsker’s inequality. Let and be probabilities (or densities) of for two discrete (continuous) distributions over discrete (continuous) set . Then
[TABLE]
where is Kullback–Leibler divergence between and .
The following technical Lemma plays an important role in the proof of Theorem 1 (Section 5).
Lemma 1**.**
Let be a countable (or continuous) set. Let and denote probabilities (or densities) of two random variables with values in . Let be a measurable function such that for all we have . Then if , the following holds true131313In the continuous case the sum should be replaced by the integral.
[TABLE]
Proof.
Apply Pinsker’s inequality 26 for and and obtain
[TABLE]
Note that . We compute the divergence
[TABLE]
where in (29) we denote (for ) and use Hoeffding’s inequality (25) for variable which is equal to w.p. . To finish, we obtain the bound (27) by combining (28) with the upper bound (29). ∎
Lemma 2**.**
Let be an integer and be the sequence of integer delays such that . Let . Then
[TABLE]
Proof.
Note that all for contain . Thus,
[TABLE]
Since and , the obtained expression is equivalent to desired equality (30). ∎
Lemma 3**.**
Let be the sequence of experts , where . Let be the probabilistic model used in Fixed Share with prior and switch probabilities for all . Then
[TABLE]
where is the number of expert switches in .
Proof.
Simple calculations
[TABLE]
prove the lemma. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Littlestone and Warmuth [1994] N. Littlestone, M. K. Warmuth, The Weighted Majority Algorithm, Inf. Comput. 108 (2) (1994) 212–261, ISSN 0890-5401, URL http://dx.doi.org/10.1006/inco.1994.1009 . · doi ↗
- 2Freund and Schapire [1997] Y. Freund, R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Journal of Computer and System Sciences 55 (1) (1997) 119 – 139, ISSN 0022-0000, URL http://www.sciencedirect.com/science/article/pii/S 002200009791504 X .
- 3Devroye et al. [2013] L. Devroye, G. Lugosi, G. Neu, Prediction by random-walk perturbation, in: Conference on Learning Theory, 460–473, 2013.
- 4Cesa-Bianchi et al. [2007] N. Cesa-Bianchi, Y. Mansour, G. Stoltz, Improved second-order bounds for prediction with expert advice, Machine Learning 66 (2-3) (2007) 321–352.
- 5Kotłowski [2018] W. Kotłowski, On minimaxity of follow the leader strategy in the stochastic setting, Theoretical Computer Science 742 (2018) 50–65.
- 6Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi, G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, New York, NY, USA, ISBN 0521841089, 2006.
- 7Vovk [1990] V. G. Vovk, Aggregating Strategies, in: Proceedings of the Third Annual Workshop on Computational Learning Theory, COLT ’90, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ISBN 1-55860-146-5, 371–386, URL http://dl.acm.org/citation.cfm?id=92571.92672 , 1990.
- 8Vovk [1998] V. Vovk, A Game of Prediction with Expert Advice, J. Comput. Syst. Sci. 56 (2) (1998) 153–173, ISSN 0022-0000, URL http://dx.doi.org/10.1006/jcss.1997.1556 . · doi ↗
