Finite-time optimality of Bayesian predictors
Daniil Ryabko

TL;DR
This paper proves that in the most general setting of sequential probability forecasting, Bayesian predictors can achieve near-optimal cumulative loss within a logarithmic factor, regardless of model assumptions.
Contribution
It establishes the first non-asymptotic bound showing Bayesian predictors' finite-time optimality in extremely general, assumption-free settings.
Findings
Bayesian predictor's loss matches any predictor's loss up to log n additive term.
The bound applies uniformly over all models in the set C and all time steps.
A lower bound shows the unavoidable growth of loss difference over time.
Abstract
The problem of sequential probability forecasting is considered in the most general setting: a model set C is given, and it is required to predict as well as possible if any of the measures (environments) in C is chosen to generate the data. No assumptions whatsoever are made on the model class C, in particular, no independence or mixing assumptions; C may not be measurable; there may be no predictor whose loss is sublinear, etc. It is shown that the cumulative loss of any possible predictor can be matched by that of a Bayesian predictor whose prior is discrete and is concentrated on C, up to an additive term of order , where is the time step. The bound holds for every and every measure in C. This is the first non-asymptotic result of this kind. In addition, a non-matching lower bound is established: it goes to infinity with but may do so arbitrarily slow.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
Finite-time optimality of Bayesian predictors
Daniil Ryabko
Abstract
The problem of sequential probability forecasting is considered in the most general setting: a model set is given, and it is required to predict as well as possible if any of the measures (environments) in is chosen to generate the data. No assumptions whatsoever are made on the model class , in particular, no independence or mixing assumptions; may not be measurable; there may be no predictor whose loss is sublinear, etc. It is shown that the cumulative loss of any possible predictor can be matched by that of a Bayesian predictor whose prior is discrete and is concentrated on , up to an additive term of order , where is the time step. The bound holds for every and every measure in . This is the first non-asymptotic result of this kind. In addition, a non-matching lower bound is established: it goes to infinity with but may do so arbitrarily slow.
1 Introduction
Choosing a model is a hard problem. Its solutions are often driven by the ease of finding an algorithm rather than by the adequacy of the model to the task at hand. In the context of the prediction problem, at the very least it is typically assumed that a predictor whose loss is sublinear exists. Even under this assumption, there are no generic methods for constructing a predictor given only a model set . This applies not only to the prediction problem, but more generally. One generic method for constructing a learning algorithm is the Bayesian one: choose a prior over the model class and predict according to the posterior distribution given the data. However, there are classical results that show that some, or even, in some sense, most of the priors result in an inconsistent method (Freedman, 1963, 1965; Diaconis und Freedman, 1986). Therefore, the question arises: is it possible to show that the Bayesian with at least some prior will be optimal? In the asymptotic sense, this question was answered in the positive by Ryabko (2010, 2017) (first under the assumption that the best achievable asymptotic average error is 0, and then without this assumption). Thus, the smallest asymptotic average error is achieved by a Bayesian predictor with some prior. However, this leaves open the question of what happens before infinity, allowing for the possibility that, for finite , every Bayesian predictor is grossly suboptimal.
Here we resolve this doubt, and show that, for any model set , there is a prior over this set, such that the Bayesian predictor with this prior has optimal cumulative error up to an additive term of order for every time step for every measure in (not just with prior average). This means that, the regret of being Bayesian, with some prior, on time-average is at most . This is generally considered rather small; in particular, already the fact that there is no multiplicative factor may be remarkable, and already this is new for the case of cumulative loss. We also establish a lower bound, though the gap with respect to the upper bound is relatively large: the lower bound on the cumulative regret of being a Bayesian goes to infinity but may do so arbitrarily slow.
Setup. A bit more formally, the problem is that of sequential probability forecasting in the following setting. Given a sequence of observations , where is a finite set, it is required to predict the probabilities of observing for each , before is revealed, after which the process continues sequentially, . The problem is considered in full generality; in particular, outcomes may exhibit arbitrary dependence. What is given is a set of measures over the space of all one-way infinite sequences. It is assumed that one of the measures in , say , is chosen to generate the data, and it is required to construct a predictor whose error is as small as possible for every . The error is measured in terms of the expected (with respect to the unknown measure that generates the data) cumulative (over time steps) KL divergence (log loss) :
[TABLE]
Motivation. This and related problems arise in a variety of applications, where the data may be financial, such as a sequence of stock prices; human-generated, such as a written text or a behavioural sequence; biological (DNA sequences); physical measurements and so on. In many of these applications very little, if anything, is known about the process that generates the data, and therefore it is hard to come up with reasonable assumptions. Moreover, achieving 0 asymptotic average error is often hopeless. For example, one can never hope to learn to predict accurately the probabilities of the stock market prices, even on long-term average; nor the probability distribution of the next letter of a human-written or other natural text, a problem that is directly linked to compressing such texts. This prompts a consideration of very general classes of environments , that would allow for some learning yet would also encompass as much as possible all the natural environments one tries to model.
One way to come up with such sets is considering changing environments. For instance, the data sequence may have a number of change points, such that between each two consecutive change points the sequence is generated by a relatively simple measure (e.g., i.i.d. or Markov), but the sequence of change points is essentially arbitrary, with the only constraint being a one on the frequency of changes. Another way may be to consider arbitrary additive trends: again, take a sequence generated by a measure from a relatively simple set and sum it up with another, which may be arbitrary except for a constraint on how fast it changes.
These are just some of the ways of constructing large sets of measures . These methods do not come close to fully addressing the challenges arising in applications just mentioned. Here we do not concentrate on any particular example, but rather attempt to tackle the problem in its full generality.
The main result. Take any set of measures and an arbitrary predictor . We show that there exists a Bayesian predictor, , such that its excess loss with respect to is at most logarithmic:
[TABLE]
for every . Moreover, the prior of the Bayesian predictor can always be chosen to be discrete, that is,
[TABLE]
where , and are real weights. This in particular allows us to consider sets which may not be measurable. The constants in the term are small and are given explicitly; apart from absolute constants, there is also a linear dependence on the size of the alphabet .
This is a theoretical result. More steps remain to be made before real applications can be addressed. Perhaps the most important one is finding a general method for constructing the optimal prior whose existence is established in this work.
In addition, a lower bound is established, showing that there exists a set of measures and a measure , such that for every Bayesian predictor whose prior is concentrated on , there exists a function , there exist infinitely many time steps and measures such that
[TABLE]
for all .
Thus, there is an order- gap between the upper and the lower bounds.
Related work. Apart from the previously mentioned results on which this work builds, one should mention an alternative general approach to prediction, namely, prediction with expert advise Cesa-Bianchi und Lugosi (2006). Here it is assumed that the sequence one tries to predict is completely arbitrary, but, instead, one is given a set of predictors (or experts) to compete with. The relations between this setting and the one considered here have been analysed by Ryabko (2011). What is important to note is that for this problem in its full generality there is so far no generic method for constructing predictors to compete with an arbitrary set of experts . In particular, Ryabko (2016) shows that there are sets such that every Bayesian predictor has suboptimal asymptotic average regret. Note that such sets must necessarily be large (in particular, uncountable), while most of the work on expert advise concentrates on finite sets of experts or else on sets of experts satisfying some very specific properties. It remains open to find which properties of sets of experts are necessary and sufficient for any general algorithm (Bayesian or not) to be optimal.
2 Setup
Let be a finite set (the alphabet), and let
[TABLE]
The notation is used for . The symbol denotes the expectation with respect to a measure . We consider (probability) measures on , where is the usual Borel sigma-field.
In general, a Bayesian predictor over a set is a measure where is a measure over the set of all measures on , the latter being assumed endowed with the structure of a probability space Gray (1988). However, in this paper we shall only be dealing with Bayesian predictors with discrete priors, that is, with predictors of the form , where are reals (that play the role of the distribution above), and , . This allows us to avoid any measureability issues, in particular allowing to be non-measurable.
For two measures and introduce the expected cumulative Kullback-Leibler divergence (KL divergence) as
[TABLE]
In words, we take the -expected (over data) cumulative (over time) KL divergence between - and -conditional (on the past data) probability distributions of the next outcome; and this gives simply the -expected log-ratio of the likelihoods. Here will be interpreted as the distribution generating the data.
3 Main result
The main result shows that the performance of any predictor can be matched by that of a Bayesian predictor with some prior, up to an additive term.
Theorem 1**.**
Let be any set of probability measures on , and let be another probability measure on this space, considered as a predictor. Then there is a discrete Bayesian predictor , that is, a predictor of the form where and , such that for every we have
[TABLE]
where the constants in are small and are given in (24) using the notation defined in (1), (4), (18) and (25). The dependence on the alphabet size, , is linear () and the rest of the constants are universal.
The proof (which follows below) uses the construction from Ryabko (2017) with a refined and added analysis that allows for rates extraction. The main ideas of the proof are as follows. First of all, a separate predictor is constructed to work on time steps to for each ; these predictors are later summed up with weights to obtain the final predictor. Before going any further, note that constructing a predictor for each must be done without forgetting the rest of the time indices : in fact, taking a predictor that is minimax optimal for each and summing these predictors up (with weights) for all may result in the worst possible predictor overall, and in particular, a one much worse than the predictor given. An example of this behaviour is given in the proof of Theorem 2 (the lower bound). Thus, the measure is used in an essential way when constructing a predictor for each of the time steps . For each , we consider a covering of the set with subsets, each of which is associated with a measure from . These latter measures are then those the prior is concentrated on (that is, they are summed up with weights). The covering is constructed as follows. The log-ratio function , where is the predictor whose performance we are trying to match, is approximated with a step function for each , and for each size of the step. The cells of the resulting partition are then ordered with respect to their probability. The main part of the proof is then to show that not too many cells are needed to cover the set this way up to a small probability. Quantifying the “not too many” and “small” parts results in the final bound.
of Theorem 1..
Define the weights as follows: , and, for
[TABLE]
where is the normalizer such that . Replacing with if necessary, where is the i.i.d. probability measure with equal probabilities of outcomes, i.e. for all , we shall assume, without loss of generality,
[TABLE]
The replacement is without loss of generality as it adds at most to the final bound (to be accounted for). Thus, in particular,
[TABLE]
The first part of the proof is the following covering construction.
For each , define the sets
[TABLE]
From Markov inequality, we obtain
[TABLE]
For each let be the partition of into intervals defined as follows. , where
[TABLE]
Thus, is a partition of into equal intervals but for some padding that we added to the leftmost and the rightmost intervals: on the left we added and on the right .
For each , , define the sets
[TABLE]
Observe that, for every , these sets constitute a partition of into disjoint sets: indeed, on the left we have by definition (7) of , and on the right we have from (5). In particular, from this definition, for all we have
[TABLE]
For every and consider the following construction. Define
[TABLE]
(since are finite all suprema are reached). Find any such that and let . For , let
[TABLE]
If , let be any such that , and let ; otherwise let and . Note that, for each there is such that and thus from (10) we get
[TABLE]
Finally, define
[TABLE]
(Notice that for every there is only a finite number of positive , since the set is finite; thus the sum in the last definition is effectively finite.) Finally, define the predictor as
[TABLE]
where is a regularizer defined so as to have for each and
[TABLE]
this and the stronger statement (5) for can be obtained analogously to the latter inequality in the case the i.i.d. measure is in ; otherwise (since we need to define as a combination of probability measures from only), can be defined the same way as is done in (Ryabko, 2010, Step r of the proof of Theorem 5); for the sake of completeness, this argument is given in the end of this proof.
Next, let us show that the measure is the predictor whose existence is claimed in the statement.
Introduce the notation
[TABLE]
with this notation, for any set we have
[TABLE]
First we want to show that, for each , for each fixed , the sets are covered by sufficiently few sets , where “sufficiently few” is, in fact, exponentially many with the right exponent. By definition, for each the sets are disjoint (for different ) and have non-increasing (with ) -probability. Therefore, for all . Hence, from the definition of , we must also have for all . From the latter inequality and (11) we obtain
[TABLE]
Take to obtain
[TABLE]
Moreover, for every , for each , there is such that and thus the following chain holds
[TABLE]
where the first inequality is from (14), the second from (13) with , the third is by definition of , the fourth uses for the exponential term, as well as for , which will be justified by the choice of in the following (25), the fifth inequality uses (12), and the final equality introduces defined as
[TABLE]
We have
[TABLE]
For the first term, from (17) we obtain
[TABLE]
For the second term in (19), we recall that , is a partition of , and decompose
[TABLE]
Next, using (15) and an upper-bound for the -probability of each of the two sets in (21), namely, (16) and (8), as well as , we obtain
[TABLE]
Returning to (20), from Jensen’s inequality one can show (see, e.g., (Ryabko, 2010, equation 11)) that, for any set ,
[TABLE]
Therefore, using (6), similarly to (22) we obtain
[TABLE]
Combining (19) with (20), (22) and (23) we derive
[TABLE]
setting
[TABLE]
we obtain the statement of the theorem.
It remains to come back to (15) and define the regularizer as a combination of measures from for this inequality to hold. For each , denote
[TABLE]
and let, for each , the probability measure be any probability measure from such that . Define
[TABLE]
for each , and let . For every we have
[TABLE]
for every and every , establishing (15). ∎
4 Lower bound
In this section we establish a lower bound on being a Bayesian with the best prior. The bound leaves a significant gap with respect to the upper bound, but it shows that the regret of using the Bayesian predictor with the best prior for the given problem cannot be upper-bounded by a constant.
Theorem 2**.**
There exists a set of measures and a measure , such that for every Bayesian predictor whose prior is concentrated on , there exists a function which is non-decreasing and goes to infinity with , there exist infinitely many time steps and measures such that for all .
Thus, the lower bound goes to infinity with but may do so arbitrarily slow. This leaves a gap with respect to the upper bound of Theorem 1.
Note also that this formulation is good enough to be the opposite of Theorem 1, because the formulation of the latter is strong: Theorem 1 says that for every and for every (the regret is upper bounded), so, in order to counter that, it is enough to say that there exists and there exists (such that the regret is lower bounded); Theorem 2 is, in fact, a bit stronger, since it establishes that there are infinitely many such . However, it does not preclude that for every in the loss of the Bayesian is upper-bounded by a constant independent of , while the loss of is linear in . This is, in fact, the case in the proof.
Proof.
Let . Let be the set of Dirac delta measures, that is, the measures each of which is concentrated on a single deterministic sequence, where the sequences are all sequences that are 0 from some on. In particular, introduce , . Let be the set of all measures such that for some and let .
Observe that the set is countable. It is therefore, very easy to construct a (Bayesian) predictor for this set: enumerate it in any way, say spans all of , fix a sequence of positive weights that sum to 1, and let
[TABLE]
Then for all . That is, for every the loss of is upper-bounded by a constant: it depends on but not on the time index . So, it is good for every for large , but may be bad for some for (relatively) small , which is what we shall exploit.
Observe that, since is countable, every Bayesian with its prior over must have, by definition, the form (26) for some weights and some measures . Thus, we fix any Bayesian in this form.
Define to be the Bernoulli i.i.d. measure with the parameter 1/2. Note that
[TABLE]
for every . This is quite a useless predictor; its asymptotic average error is the worst possible, 1. However, it is minimax optimal for every single time step :
[TABLE]
where the is over all possible measures. This is why is hard to compete with— and, incidentally, why being minimax optimal for each separately may be useless.
For each , let be the weight that spends on the measures in the sets with , and let be the set of these measures:
[TABLE]
and
[TABLE]
By construction,
[TABLE]
Next, for each , let (these are all the sequences in with on the th position). Note that for each , while . From the latter equality, there exists and such that
[TABLE]
This, (28) and (27) imply the statement of the theorem. ∎
5 Conclusion and future work
The main result, Theorem 1, is the first one to show finite-time optimality of the Bayesian method for the prediction problem in full generality; or, perhaps, at this generality, for any learning problem. A number of important questions remain, both directly extending the result of this work and more general.
Lower bounds, necessity of the term. The first question is how sharp is the result. So far, the lower bound only shows that, for every prior, the Bayesian may suffer more than constant regret. The question whether the term is necessary remains open. If it is, one can ask what is the best constant in front. In the proof of the main result, the constant comes, first of all, from all the weights used in constructing the predictor, that is, from the prior. Each of the sums in (14) contributes one or two . The outermost is perhaps (partially) removable with some version of the doubling trick: that is, instead of summing over all time steps one would only sum over some time steps, and reuse the predictors at remaining time steps. Yet, as the proof of Theorem 2 shows, some regret from summing up over different time steps is unavoidable. The rest are less clear how to optimize. Finally, one additional comes from the definition of the sets in (8), via the top line of (9). This one would be harder to remove, because the term is necessary in (22).
One could also ask the question of how important it is to optimize this constant. First of all, of course, it is only important if the term is at all necessary. But if it is necessary, then the constant is important, because the optimal loss is of order in some commonly studied special cases of , such as i.i.d. or Markov measures. (It is worth mentioning that the known optimal predictors in these cases Krichevsky (1993) are, in fact, Bayesian.)
Moreover, it may be worth trying to improve the bounds specifically for the case , since in the opposite case it is not important.
Further generalizations. Some further natural and interesting generalizations are to different (or general) loss functions, as well as to infinite (countable or continuous) alphabets .
However, the most important direction for further research appears to be finding a general method of constructing a prior that results in an optimal predictor for an arbitrary class of measures . Another interesting question, mentioned in the introduction, is finding out under what conditions the Bayesian procedure, or indeed any other general method, is optimal for the non-realizable version of the problem; as discussed, some conditions are necessary, as shown in Ryabko (2016).
Finally, it is also interesting to find out to what extent the obtained result can be generalized to interactive learning problems, such as bandits, or, more generally, reinforcement learning.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Cesa-Bianchi und Lugosi (2006) \NAT@biblabelnum Cesa-Bianchi und Lugosi 2006 Cesa-Bianchi , N. ; Lugosi , G.: Prediction, Learning, and Games . Cambridge University Press, 2006. – ISBN 0521841089
- 2Diaconis und Freedman (1986) \NAT@biblabelnum Diaconis und Freedman 1986 Diaconis , P. ; Freedman , D.: On the Consistency of Bayes Estimates. In: Annals of Statistics 14 (1986), Nr. 1, S. 1–26
- 3Freedman (1963) \NAT@biblabelnum Freedman 1963 Freedman , David A.: On the asymptotic behavior of Bayes estimates in the discrete case I. In: The Annals of Mathematical Statistics (1963), S. 1386–1403
- 4Freedman (1965) \NAT@biblabelnum Freedman 1965 Freedman , David A.: On the asymptotic behavior of Bayes estimates in the discrete case II. In: The Annals of Mathematical Statistics 36 (1965), Nr. 2, S. 454–456
- 5Gray (1988) \NAT@biblabelnum Gray 1988 Gray , Robert M.: Probability, Random Processes, and Ergodic Properties . Springer Verlag, 1988
- 6Krichevsky (1993) \NAT@biblabelnum Krichevsky 1993 Krichevsky , R.: Universal Compression and Retrival . Kluwer Academic Publishers, 1993
- 7Ryabko (2017) \NAT@biblabelnum Ryabko 2017 Ryabko , D.: Universality of Bayesian mixture predictors. In: Proceedings of the 28th International Conference on Algorithmic Learning Theory (ALT’17) Bd. 76. Kyoto, Japan : JMLR, 2017, S. 57–71
- 8Ryabko (2010) \NAT@biblabelnum Ryabko 2010 Ryabko , Daniil: On Finding Predictors for Arbitrary Families of Processes. In: Journal of Machine Learning Research 11 (2010), S. 581–602
