Asymptotic nonparametric statistical analysis of stationary time series
Daniil Ryabko

TL;DR
This paper reviews asymptotic nonparametric statistical methods for stationary time series, highlighting what can and cannot be achieved with stationarity assumptions alone, including clustering, change point detection, and hypothesis testing.
Contribution
It summarizes recent results on the asymptotic consistency of algorithms for stationary time series, clarifying the limits and possibilities of statistical inference under minimal assumptions.
Findings
Certain problems like homogeneity are impossible to solve under stationarity alone.
Algorithms for clustering and change point detection can be asymptotically consistent.
A topological criterion for the existence of consistent tests is proposed.
Abstract
Stationarity is a very general, qualitative assumption, that can be assessed on the basis of application specifics. It is thus a rather attractive assumption to base statistical analysis on, especially for problems for which less general qualitative assumptions, such as independence or finite memory, clearly fail. However, it has long been considered too general to allow for statistical inference to be made. One of the reasons for this is that rates of convergence, even of frequencies to the mean, are not available under this assumption alone. Recently, it has been shown that, while some natural and simple problems such as homogeneity, are indeed provably impossible to solve if one only assumes that the data is stationary (or stationary ergodic), many others can be solved using rather simple and intuitive algorithms. The latter problems include clustering and change point estimation. In…
| I.i.d. | Markov | Stationary ergodic | |
|---|---|---|---|
| Asymmetric consistency | Test exists | No test | No test |
| Asymptotic consistency | Test exists | Test exists | No test (Theorem 4.2) |
| I.i.d. | Markov | Stationary ergodic | |
|---|---|---|---|
| Asymmetric consistency | Test exists | Test exists | No test (Proposition 6.3) |
| Asymptotic consistency | Test exists | Test exists | Open question |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Asymptotic nonparametric statistical analysis of stationary time series
Daniil Ryabko
This book is about making statistical inference from stationary discrete-time processes. The assumption of stationarity alone is often considered too weak to make any meaningful inference. Here this view is challenged by showing that, while some rather basic problems indeed can be proven not to admit any solution in this setting, surprisingly many are solvable without any further assumptions. These includes such complex problems as clustering and change-point analysis. Some general results characterizing those problems that admit a solution are also presented.
The material in this volume is presented in a way that presumes familiarity with basic concepts of probability and statistics, up to and including probability distributions over spaces of infinite sequences. All the required background material can be found in the excellent monograph Gray:88 , which also contains a much deeper exposition of some of the key concepts used here, such as the distributional distance. Familiarity with ergodic theory is not required for understanding the material exposed in the present volume. Indeed, with two exceptions, the proofs do not rely on any facts deeper than the convergence of frequencies. One exception is Chapter 4, which deals with hypothesis testing and provides a characterisation of hypotheses for which consistent tests exist; the required background material for this chapter can be found in Chapter 1. The other exception is Section 4, which establishes impossibility of discrimination between process distributions; this section is self-contained. The reader who is familiar with ergodic theory and feels the exposition in this volume is somewhat unorthodox, can find all the necessary links to the more familiar framework in Shields:96 ; the latter book is also recommended to anyone seeking a deeper understanding of such results as the slow convergence of frequencies and entropy estimates, the classic ergodic theorem and much more.
This book is organized as follows. Chapter Asymptotic nonparametric statistical analysis of stationary time series is introductory: besides providing some motivation for studying the problems addressed, it also introduces in an informal manner the main concepts used and the main results presented. Chapter 1 introduces the notation and definitions used in the subsequent chapters, as well as some necessary background material. Chapter 2 considers the most basic problems of statistical inference, on which the rest of the volume builds: estimating a distance between processes (the distributional distance) and the problem of homogeneity testing or process discrimination, which, crucially for the subsequent problems addressed, is shown to be impossible to solve in the general setting of this book. Chapter 3 is devoted to clustering and change-point problems, which can be solved, or, in some cases, can be shown to admit no solution, based on the result of the preceding chapter. Chapter 4 addresses the problems of hypotheses testing in the general form: studying which pairs of hypotheses admit a consistent test. Finally, Chapter 5 discusses various generalizations of the presented results, as well as some directions for future research.
Acknowledgements
Thanks to Léon Bottou for giving me the idea to write a book on this subject and for encouraging me to do it. Thanks to Boris Ryabko and Azadeh Khaleghi, in collaboration with whom some of the results presented here were obtained.
Santa Cruz de la Sierra Daniil Ryabko
Contents
-
2 What is possible and what is not possible to infer from stationary processes
-
1 Estimating the distance between processes and reconstructing a process
-
3 Extensions: unknown , online clustering and clustering with respect to independence
-
Known number of distributions, unknown number of change points
-
2 Characterizing hypotheses for which consistent tests exist
Chapter 0 Introduction
This book is about making statistical inference from discrete-time processes under what is perhaps the weakest of statistical assumptions: stationarity. Before embarking on this journey, it is worth asking the question of why it is interesting to study statistical problems under this assumption alone, or under similar related assumptions. To answer this question, one should first consider what it means to have a good set of assumptions, or a good model, for a statistical problem at hand.
Choosing the right assumptions presents the following trade-off. On the one hand, making strong assumptions makes the inference task easier and allows one to obtain stronger performance guarantees for the algorithms developed. For example, by assuming that the data are independent and identically distributed (i.i.d.), one gets at one’s disposal an extremely versatile statistical toolkit that is a result of centuries of research on this model. With this, it is possible to obtain sharp bounds on error probabilities of the resulting methods. Even stronger results can be obtained if one further makes parametric assumptions. On the other hand, all such results are useless if the assumptions made do not hold for the data at hand. Of course, one can try to apply a statistical test to the data in order to verify the validity of one or another model. This, however, only pushes back the problem, because to use a test one needs to make another set of assumptions, called the alternative. Indeed, it is not possible to test, based on data, that the assumption holds versus it does not hold. For example, it is not possible to test that the data are Gaussian i.i.d. versus the distribution of the data is anything else except Gaussian i.i.d. This is because the alternative “anything else” is too general and includes, for example, such distributions as the one that is concentrated precisely on the data available. It is, however, possible to design a test for the hypothesis “the data are Gaussian i.i.d.” versus “the data are i.i.d. but not Gaussian” or “the data are i.i.d.” versus “the distribution of the data is stationary.” In other words, it may be possible to test a set of assumptions versus an alternative set of assumptions . The latter is typically much more general; in fact, one is interested in making it as general as possible. Nonetheless, the alternative hypothesis is still a set of assumptions.
And so we are back to the question of how one can select a model or a set of assumptions for the data one has. Here we need to admit that this question brings us outside of the realm of mathematics. The answer is simply that one should make assumptions that one can reasonably expect to hold based on the specifics of the target application. Thus, the assumptions should be qualitative, natural and simple — utterly unmathematical terms, but such is the problem. Otherwise, there is little hope to be able to say whether the model is adequate for any given application. A good example are assumptions based on independence. Indeed, this must be one of the reasons why independent and identically distributed data are so widely studied: it is often possible to tell whether the application produces data that are independent or that are not independent. Other models that are based on independence are Markov chains and, more generally, Bayesian networks.
Unfortunately, there are not many alternatives to independence-based models. Thus, if the data are utterly and completely dependent, as perhaps are most of the data in the world, a statistician is a bit short of options. A common generalisation to resort to in such cases are various mixing assumptions. These allow one to extend the tools and methods developed for i.i.d. data to the cases of carefully constrained dependence. However, mixing assumptions are neither verifiable against a general alternative (such as stationarity) nor, to say the least, are easy to asses informally from the data.
Stationarity is perhaps the only general non-parametric model that is not based on independence, and which is also qualitative, natural and simple to assess from data. Next we take a brief and informal look at stationarity and associated concepts.
1 Stationarity, ergodicity, AMS
Very informally, assuming that the data are stationary means assuming that the time index itself bears no information. Thus, it does not matter whether the data we see are or they are in fact . I.i.d. data obviously satisfy this assumption, as do, with some minor tweaks to be discussed below, most other models in wide use, such as Markov chains. Thus, stationarity may be used as an alternative hypothesis for testing other models. It is also suited for the cases when one knows next to nothing about the data, and thus wishes to make as few assumptions as possible. In fact, the assumption is so general that one wonders whether any inference is possible under stationarity alone. Indeed, if any inference is possible at all it is due the the associated property of ergodicity.
A process is ergodic if the frequency of every finite-time event almost surely converges to a constant. Thus, for binary-valued processes, the frequency of any word, such as 0, 01, or 011010, converges to some constant. We cannot say anything about the speed of this convergence, but the asymptotic property is already enough to make inference. The ergodic decomposition theorem establishes that every stationary process is a mixture of processes that are stationary and ergodic. Thus, a stationary process can be thought of as, first, before we start observing the data, drawing a stationary ergodic process (according to some prior distribution over such processes) and then using this stationary ergodic process to generate the data. To put it simpler: whenever we observe a stationary process, we observe, in fact, a stationary ergodic process. Thus, for most practical as well as many theoretical considerations, a stationary process is a stationary ergodic process.
Note that an ergodic process does not have to be stationary. A good example of an ergodic non-stationary process is a finite-state connected Markov chain with an initial distribution on the states that is different from the stationary distribution. Asymptotically, this process is equivalent to the Markov chain with the initial distribution taken to be the stationary distribution. One can take mixtures of ergodic process, obtaining processes that are called asymptotically mean stationary or AMS. An AMS process is such that the frequencies of all finite-time events converge almost surely (but not necessarily to a constant). Since the definition of an ergodic process only involves its asymptotic properties, all the inference one can make about such processes concern their asymptotic behaviour. In this (asymptotic) sense, similar to stationary processes, an AMS process can be thought of, very roughly, as first drawing an ergodic process (according to some prior distribution over such processes) and then using this ergodic process to generate the data. Again, for most purposes AMS processes are ergodic processes. In turn, ergodic processes are a certain generalisation of stationary ergodic processes: as in the Markov-chain example, they are equivalent in asymptotic. Another example of can think of is taking a realization of a stationary ergodic process and adding some arbitrary prefix to it; or doing to it anything else that does not affect asymptotic frequencies.
It is worth emphasizing that, with the exception of stationarity itself, all the definitions we are using only tell us something about asymptotic properties of a process; moreover, for the purposes of statistical inference that we shall be exploring, stationarity can only be used in conjunction with or via ergodicity (which, fortunately, can always be presumed via the ergodic decomposition theorem mentioned). Therefore, any results we should expect shall also be about asymptotic properties of the algorithms that we shall construct.
Thus, there will be little difference for us in the course of this volume between ergodic processes and stationary ergodic processes, and between stationary processes and AMS processes. In fact, most of the results of this book do not require any other assumption than AMS or ergodicity. Thus, they can be thought of as answering the question:
{svgraybox}
Stationarity-based statistical inference: What statistical inference can one make under the only assumption that frequencies converge, without any guarantees on the speed of this convergence?
One exception is Chapter 4, where we do need our processes to be stationary, and use some deeper results of ergodic theory. The other exception is the impossibility result concerning process discrimination (along with its implications) which applies to an even smaller class of process; since it is an impossibility result, this makes it stronger.
The main difference between the problems of statistical inference addressed in this volume and those studied in the vast majority of statistical literature is the lack of any guarantees on the speed of convergence that one can use.
In contrast, independence-based methods rely heavily on concentration-of-measure results that are used to bound the speed of convergence and, consequently, open the possibility to obtain finite-time bounds on the error of the resulting algorithms. In fact, the (conditional) independence assumptions are typically not used directly but rather through concentration of measure results. Mixing assumptions provide a generalisation that allows one to forego independence but still use the corresponding speed of convergence guarantees. Thus, one can think of independence-based models and their generalizations as studying the following general question:
{svgraybox}
Independence-based statistical inference: What statistical inference can one make under the assumption that frequencies converge and the speed of this convergence can be bounded?
We shall see in this book that the difference between these two general questions is smaller than one might think, but sometimes the contrast between what is possible and what is not possible to do without any speeds of convergence is rather striking and even counter-intuitive.
2 What is possible and what is not possible to infer from stationary processes
It appears that, with the exception of the problem of probability forecasting to be mentioned below in this section, the prevailing view in the literature is that assuming only that a process is stationary and ergodic is not enough to make statistical inference. This view may stem in part from the rather influential 1990 paper Ornstein:90 by Ornstein and Weiss. This paper is full of deep and insightful results about -processes, which is a set of processes smaller than that of stationary ergodic processes, but is rather dismissive of the general case. In particular, it makes statements such as “In general, one cannot hope to guess the long-term behaviour from finite information” (referring to the non- case); “If a totally ergodic process is not , then it cannot be approximated arbitrarily well by -step Markov processes.” The work Ornstein:90 goes further in this direction when it considers the problem of discrimination between two processes. This problem, also know as homogeneity testing, consists in telling, given two finite samples whose length, in this setting, is allowed to grow to infinity, whether they were generated by the same or different process distributions. It is stated in Ornstein:90 that, outside of the class of -processes, even this simple “yes-no” question of “same-different” cannot be answered in an effective way. However, the example used to demonstrate this statement only shows that it is not possible to estimate a certain distance, called distance, between stationary ergodic processes that are not . This is a rather different statement, and a one made about a different problem: indeed, in order to answer the “same-different” question, one might try to estimate any other distance or, more generally, use any algorithm whatsoever. Thus, the statement made in Ornstein:90 about the problem of discrimination can be at most considered a conjecture. The distance is also crucial to understanding the previous statements made: it is not possible to approximate a stationary ergodic process with -step Markov processes in -distance, or to construct any other estimate of such a process that would be asymptotically consistent in terms of this distance.
The picture changes dramatically if we change the distance between processes that we are trying to estimate. As we are going to see in this volume, using a different distance, it is possible to construct asymptotically consistent estimates of the distribution of an arbitrary stationary ergodic process, as well as to solve a variety of other interesting statistical problems. The distance we are going to use is well known, but had somehow remained largely unused. Gray Gray:88 calls it distributional distance, and this is the name we shall use here, despite its apparent ambiguity: indeed, it may seem to refer to any distance between distributions. As for the problem of discrimination between process distributions, it turns out that indeed, as conjectured by Ornstein and Weiss Ornstein:90 , it does not admit a solution if we only assume that the distributions are stationary ergodic. Interestingly, the same impossibility result holds for the smaller class of processes as well, for which it is possible to estimate the distance, as shown in the same work Ornstein:90 . Thus, no amount of data may be sufficient to answer the simple “same-different” question about two process distributions. This result is formally demonstrated in Section 4.
Since these two problems, distance estimation and discrimination between processes, are crucial for the development of the material presented here, let us look at them at some more detail.
Recall that one distance (or a metric — all distances considered in this volume are metrics unless stated otherwise) is weaker, in the topological sense, than another, if every sequence111Here we are only concerned with separable metric spaces. that converges in the former converges in the latter, but the opposite does not necessarily hold. Thus, it is “easier” for a sequence to converge in a weaker distance, which makes it easier to construct a sequence of estimates of a process that converges to this process. Likewise, given two data sequences, a weaker distance between the process distributions that generates these sequences is easier to estimate. The distributional distance is weaker, in the topological sense, than the -distance.222To make complete sense of this sentence, we would need to define the distances formally first, which is done in the next chapter. We shall see that the definition of the distributional distance is ambiguous: it depends on a set of parameters, changing which may change the resulting topology. However, it is possible to make this statement formally correct. It is thus reasonable to expect that the former can be estimated for a larger class of processes than the latter. Indeed, as is shown in this volume, the distributional distance can be estimated for stationary ergodic processes, while, as is shown in Ornstein:90 , -distance can be estimated for the smaller set of -processes but not for stationary ergodic processes. The strongest possible distance is the discrete 0-1 distance, which takes the value 0 if and only if two distributions are the same and 1 otherwise. It is this distance that we are trying to estimate when answering the “same-different” question of process discrimination. It thus should be of no surprise that it is not possible to estimate it even for -processes, even though it is possible to estimate it for smaller classes, such as, for example, i.i.d. processes. For many different problems, however, it is enough to have consistent estimates of at least some distance between process distributions, and thus it makes sense to prefer weaker distances, since this allows one to consider wider sets of processes. We shall review shortly which problems of inference can be solved using consistent estimates of distributional distance (or, indeed, of any distance between process distributions).
Taking a different look at the problem of process discrimination, one can see that it is linked to another fundamental impossibility result — the impossibility to establish the speed of convergence, say, of frequencies. The way we have defined ergodic processes, as all processes for which frequencies converge a.s. to a constant, makes it evident that this convergence may be arbitrary slow, so there is no guarantee on the speed. It is not so evident that such a guarantee does not exist if we consider the set of all stationary ergodic processes (that is, adding the requirement of stationarity). The proof of the fact that indeed the convergence of frequencies can be arbitrary slow for stationary ergodic processes can be found, for example, in the excellent monograph Shields:96 , which also demonstrates the equivalence of the (unorthodox) definition that we adopt here to the more common one formulated in terms of shift-invariant sets. Imagine now an algorithm that tries to solve the discrimination problem based on (consistent) estimates of some distance. It makes these estimates based on sampels of longer and longer size . Suppose that these estimates keep approaching 0, let us say, exponentially with . At some point one should reasonably expect the algorithm to say that the samples were generated by the same distribution. Suppose the estimated distance at this point is . From this point on, imagine that, as the sample size continues to grow, the estimate does not decrease at all but just stays . Then, at some point, we should expect the algorithm to change its mind and to say that the samples were generated by the same distribution. At which point the estimates start decreasing again. Since there is no guarantee on the speed of convergence (of anything), there is no way to ensure that the behaviour outlined cannot happen. In fact, the proof of the impossibility result is based on constructing, for any algorithm that presumably solves the problem of discrimination (and that may or may not be based on distance estimates), a process that tricks it into changing its mind ad infinitum in this fashion.
More generally, from the discussion above on the absence of speed of convergence guarantees, it should already be clear that:
{svgraybox}
Every algorithm that we may construct shall only have asymptotic performance guarantees in the considered setting. No finite-time bounds on the probability of error are possible.
From the practical point of view this is not in itself a hindrance: what the fact that a result is asymptotic means, in practice, is that it holds when the data samples are large enough.
The only exception, where we do obtain results about what happens at every time step, is hypothesis testing. Here one may wish to invert the question, by asking for which processes distributions can we have a certain level of error at a certain finite time. These questions are considered in Chapter 4.
Having outlined the general framework and the main impossibility results, let us now briefly review the highlights of what is possible to achieve for stationary or stationary and ergodic processes.
Perhaps the one important problem concerning stationary processes that has not been deemed too difficult to solve and thus gained a fair bit of attention in the literature is the problem of prediction or probability forecasting. It consists in forecasting the probability of the next outcome conditional on the past observations , where the sequence is generated by an unknown stationary (ergodic) process distribution. This problem is of great practical importance, not in the least because it is intimately connected to the problem of data compression. Ample literature on this problem and its variations exist, which is why we do not cover it in this volume. This literature goes as far back as Ornstein:78 for the prediction with the growing past problem, and includes BRyabko:88 that solves the forward-prediction problem for finite-alphabet processes, Algoet:92 for real-valued processes, as well as BRyabko:09 ; Morvai:96 ; Morvai:97 ; BRyabko:16 and others.
The problems covered in this volume are outlined in the next section.
3 Overview of the inference problems covered
The first group of problems considered are those that are based directly on estimating a distance between process distributions. Since we have an asymptotically consistent estimator of the distributional distance, we can answer questions of the form: given three samples , , , say whether the distribution of the process that generates is closer to the distribution of or to the one of . The answer will be correct as long as the samples are long enough (that is, asymptotically correct). Some forms of this problem are known as process classification or the three-sample problem, and this is an example of a problem that we can solve. It generalizes to the problem of clustering: given samples generated by different, unknown, stationary ergodic distributions, cluster them into groups according to the distribution that generates them. Note that this problem can only be solved if is known. Indeed, the problem of discrimination corresponds to clustering just two samples, but with unknown (either 1 or 2), and already this case, as we have seen, has no solution.
The next problem to consider is change-point estimation. A sample
[TABLE]
is the concatenation of two samples and generated by different stationary ergodic distributions. It is required to find or to approximate the change point . This is possible to do with an algorithm that essentially outputs the point that maximizes the estimated distance between what is before and after it in the sample. On the other hand, the related problem of change-point detection, which consists in saying whether the sample is generated by the same distribution or there is a change of distribution somewhere, admits no solution. A generalisation of these problems to the case of multiple change points presents a delicate interplay between what is possible and what is not. We only briefly review the corresponding results in this volume (Section 2), referring the interested reader to the papers that present the full proofs Khaleghi:14 ; Khaleghi:15chp ; Khaleghi:12mchp .
As discussed above, one of the main reasons to study such general models as stationarity is to be able to use them as an alternative hypothesis in order to verify the validity of a smaller model. Thus, one may wish to test a hypothesis , which is a subset of the set of all stationary ergodic process distributions, against its complement to this set, or against its different subset. For example, testing “the process is i.i.d.” versus “the process is stationary ergodic and not i.i.d.” As we have seen above, some rather simple hypotheses, such as process discrimination (known in the context of hypotheses testing as the hypothesis of homogeneity: a hypothesis about a pair of processes that states that they have the same distribution) do not admit a consistent test, even in a very week asymptotic sense. Yet, as we shall see, some other hypotheses of practical significance, such as that the process is i.i.d. or that it is Markov, do admit a consistent test against the complement to the set of all stationary ergodic processes. Thus, it appears interesting to study the general question of which hypotheses do and which do not admit a consistent test. This is what we do in Chapter 4. The main result is a topological “if and only if” criterion for the existence of a consistent test of an arbitrary subset of the set of all stationary ergodic processes against its complement. At the same time, a number of important and interesting questions remain open. In particular, this is the only chapter where we restrict the consideration to finite-alphabet processes, leaving the general case open for further research. Some of the interesting open problems related to hypotheses testing are presented in the end of Chapter 4, while some more general ones are deferred to Chapter 5, which is devoted to generalizations.
Chapter 1 Preliminaries
To simplify the exposition, we are considering (stationary ergodic) processes with the alphabet or, in some cases, a finite set . The generalization from to is straightforward; moreover, the results can be extended to the case when is a Polish (complete separable metric) space. The symbol is used for . Elements of are called words or sequences.
Let be the Borel sigma-algebra of , and the the Borel sigma-algebra of . Let also .
Time-series distributions, processes distributions or simply processes are probability measures on .
We will be speaking about samples, typically denoted or , taking values in . This is a short-hand notation for expressions like or where and are lengths of the samples. The samples that we shall be considering are to be generated by process distributions, usually stationary or stationary ergodic, typically denoted or , (or other Greek letters) to make clear which sample they generate. This means that, say, is a (stationary ergodic) probability distribution over , and thus we are speaking about an -valued random variable of which is the initial segment of length .
Definition 0.1**.**
For a sequence taking values in and a measurable with denote the frequency with which the sequence falls in the set
[TABLE]
For example,
[TABLE]
1 Stationarity, ergodicity
A process is stationary if for any and , we have
[TABLE]
A process is called ergodic if for every there exists a constant such that with probability 1 we have
[TABLE]
A process is called stationary ergodic if it is stationary and ergodic. The following statement follows from the ergodic theorem.
Theorem 1.1** (ergodic theorem).**
For every stationary ergodic process , we have
[TABLE]
The proof of the ergodic theorem can be found, for example, in Gray:88 ; Shields:96 . The latter monograph also provides the connection to the more traditional way of defining ergodicity (in terms of shift-invariant sets); in particular, it demonstrates that the two approaches are equivalent.
The symbol is used for the set of all stationary processes on , and the symbol for the set of all stationary ergodic processes.
The set of all process distributions over can be endowed with the structure of probability space where can be taken to be the Borel sigma-algebra with respect to the distributional distance defined in Section 2 below.
The link between stationary and stationary ergodic processes is provided by the so-called ergodic decomposition theorem, which states that every stationary process is a mixture of stationary ergodic processes.
Theorem 1.2** (Ergodic decomposition).**
For any there is a measure on , such that and
[TABLE]
for every
Furthermore, a process is called asymptotically mean stationary, or AMS for short, if, for every , the frequency of converges with probability 1. These limiting frequencies define the stationary measure , which, according to the preceding theorem, admits an ergodic decomposition. Asymptotically, and are equivalent, and thus there will be little distinction between the two for us in this volume. For a detailed exposition of these results the reader is referred to Gray:88 , in particular to (Gray:88, , Theorem 7.4.1) that establishes ergodic decomposition for AMS processes.
2 Distributional distance
The general definition of the distributional distance is as follows.
Definition 2.1** (distributional distance).**
Let be a set of finite-time events each , that generates , and let be a sequence of positive reals such that . For a pair of processes the distributional distance is defined as
[TABLE]
Note that there are two sets of parameters in this definition, and , which we shall now make more specified. Let us first fix
[TABLE]
The choice of the sets is more significant. Different choices may result in different topologies. In particular, some choices of make the set of all process distributions compact with the topology of the distributional distance . This is the case if the set is a standard basis of . While there is a standard basis for in the case of , unfortunately, as Gray Gray:88 notes, there is no easy construction for such a basis even for the space of reals . In this volume, we shall not make much use of the notion of standard basis, but it will be important for us to have empirical estimates of the distributional distance. Therefore, we shall fix a specific choice of the sets for the case of discrete alphabets and for the case (which is easily generalisable to , ); we shall also make the definition of the distributional distance more specific reflecting these choices.
Definition 2.2** (Distributional distance for finitely-valued processes).**
Let the alphabet be finite. Define
[TABLE]
While equivalent to the general one, this more-specified formulation is better suited for constructing practical algorithms: we are taking the differences in probabilities of each word of length , and then take a weighted sum over all .
For real-valued processes, we shall fix the usual set of cylinders to put in the distributional distance. Consider the sets which are obtained via the partitioning of into cubes of dimension and volume , starting at the origin, and enumerated clockwise in each direction.
Definition 2.3** (Distributional distance for real-valued processes).**
Let . Define
[TABLE]
The general formulation (2) is more compact and thus more convenient for the theoretical analysis; we shall therefore use it in the proofs, while still assuming the concrete choice of the parameters and whenever necessary. The more specific formulations (4) and (5) are more convenient for constructing algorithms and empirical estimates.
Note, however, that the definition (5) is not exactly equivalent to the general definition (2). Indeed, each of the sets is infinite, and all the individual sets inside of are assigned the same weight . This is not a problem, since the total - as well as - probability of all the sets in is 1. Indeed, it is a simple exercise to check that the proofs in the subsequent chapters go through for either of the definitions. We therefore take the liberty to use the definition (2) in the proofs, but refer to the more-specified definition (5) when speaking about the algorithms. The unconvinced reader may note that the sets can be made finite but growing with , i.e., defined so as to cover growing parts of the space with finer partitions, leaving all the rest of the space as a single element of the space . This way, the partitions become finite and the triple sum in (5) can be converted back to the single sum in (2) with a different choice of the weights .
It is easy to see that is a metric (with any choice of the parameters). When talking about closed and open subsets of we assume the topology of . With this topology, the space of process distributions is separable. The set of stationary distributions is its closed subset. In addition, for the case of finite-valued alphabets, the sets and are complete and compact. (The general result (Gray:88, , Lemmas 8.2.1, 8.2.2) says that is complete and compact in case the generating set is standard; this is the case in our definition (4) but not in (5).) Proofs of these facts can be found in Gray:88 .
Chapter 2 Basic inference
In this chapter we consider some basic problems of statistical inference that underly the rest of the problems addressed in this volume. Namely, we shall see that the distributional distance can be estimated empirically, and consider some immediate implications of this fact. On the other hand, it is shown that there is no asymptotically consistent solution to the problem of discrimination (homogeneity testing) for stationary ergodic processes.
The main results of the chapter can be summarized as follows. {svgraybox}
- •
The distributional distance between stationary ergodic processes can be estimated consistently.
- •
There is no consistent discrimination procedure for stationary ergodic processes: no matter how long the sequences are, it is not possible to say whether they were generated by the same or different distributions.
- •
Based on the estimates of the distributional distance, one can solve the three-sample problem: say which two of the given three samples were generated by the same distribution.
1 Estimating the distance between processes and reconstructing a process
The main building block of the approach presented in this book is the rather simple fact that the distributional distance can be estimated empirically, simply replacing unknown probabilities with frequencies. The resulting estimate is asymptotically consistent for arbitrary stationary ergodic processes.
Definition 1.1** (empirical distributional distance).**
For samples , define empirical distributional distance as
[TABLE]
Similarly, we can define the empirical distance when only one of the process measures is unknown:
[TABLE]
*where and . *
The following lemma establishes consistency of these estimates.
Lemma 1.2**.**
Let two samples and be generated by stationary ergodic processes and respectively. Then
- (i)
**
- (ii)
**
Proof 1.3**.**
For any we can find such an index that . Moreover, by ergodic theorem, for each we have a.s., so that, with probability 1,
[TABLE]
from some step on; define . Let ( depends on the realization ). Define analogously for the sequence . Thus, for and we have
[TABLE]
*which proves the first statement. The second statement can be proven analogously. *
Note that the second statement of the lemma implies that a stationary ergodic process (or an ergodic component of a stationary process) can be asymptotically reconstructed from growing segments of a sequence it generates.
While we shall not make use of this fact, it is also instructive to note that memory- approximations of a stationary process converge to in distributional distance. This fact is rather easy to see from the definitions.
2 Calculating
The expressions (1), (2) may seem impossible to calculate, since they involve infinite sums. However, as we shall see in this section, they are easy to calculate exactly and, furthermore, can be approximated using only quasilinear computational resources.
First of all, note that, for a finite sample, for finite alphabets there are only finitely many non-zero summands in (1) and (2). For real-valued alphabets, there are infinitely many non-zero summands, but most of these can be collapsed, as they have the same value.
We proceed with the more-specified versions of the empirical distributional distance, which are empirical estimates of (4) and (5). Given two samples and , let be the size of the longer sample and define
[TABLE]
and for real-valued processes
[TABLE]
where are integer-valued parameters that grow to infinity with .
First of all, note that any values of that monotonically increase to infinity still give consistent estimates of the distributional distance (e.g., one can check that the argument of the proof of Lemma 1.2 is unaffected). On the other hand, if we set in (3), then the inner sum in (3) still has at most non-zero terms for and is 0 for . This makes the precise calculation of (3) at most quadratic.
Moreover, there is no reason to calculate the summands corresponding to since they are clearly not good estimates of the corresponding probabilities. In fact, it is reasonable to set of order , since longer subsamples are expected to be met at most once (see, for example, Kontoyiannis:94 ).
Similarly, for (4), let us begin by showing that calculating is fully tractable with . Observe that for fixed and , the sum
[TABLE]
has not more than nonzero terms (assuming ; the other case is obvious). Indeed, there are tuples of size in the sequence namely, and likewise for the sequence . Therefore, can be obtained by a finite number of calculations.
Furthermore, let
[TABLE]
and observe that for all and for each , for all the term is constant. That is, for each fixed we have
[TABLE]
so that we simply double the weight of the last nonzero term. (Note also that is bounded above by the length of the binary precision in representing the random variables .) Thus, even with one can calculate precisely. Moreover, for a fixed and for every sequence the frequencies may be calculated using suffix trees or suffix arrays, with worst case construction and search complexity (see, e.g., Ukkonen:95 ). Searching all occurrences of subsequences of length results in complexity. This brings the overall computational complexity of (4) to ; this can potentially be improved using specialized structures, e.g., Grossi:05 .
The parameters play the same role as in the discrete case, and so can be set to be of order for the same reason. Finally, to choose one can either fix some constant based on the bound on the precision in real computations, or choose it in such a way that each cell contains no more than points for all largest values of . Thus, we arrive at the following conclusion. {svgraybox} Empirical distributional distance (3), (4) is efficiently computable, and can be approximated using only quasilinear computational resources.
3 The three-sample problem
Let there be given three samples . Each sample is generated by a stationary ergodic process , and respectively. Moreover, it is known that either or , but . We wish to construct a test that, based on the finite samples and will tell whether or .
This problem is known under the names of three-sample problem and (process) classification. Its i.i.d. version, i.e., the case when each of the samples consists of i.i.d. random variables, is one of the classical problems of mathematical statistics (e.g., Lehmann:86 ). The case of dependent time series was considered in Gutman:89 , where a solution is presented under the finite-memory assumption. The material presented here is based on Ryabko:103s .
Essentially, the problem is to answer the question “which distribution is closer to which other distribution” based on the three samples given. The test we shall consider is doing this based on the estimates of the distributional distance.
Thus, let us consider a test that chooses the sample or according to whichever is closer to in . That is, we define the test as follows. If then the test says that the sample is generated by the same process as the sample , otherwise it says that the sample is generated by the same process as the sample .
Definition 3.1** (Process classifier).**
Define the classifier as follows
[TABLE]
for .
Theorem 3.2**.**
The test makes only a finite number of errors when and go to infinity, with probability 1: if then
[TABLE]
from some on with probability 1; otherwise
[TABLE]
*from some on with probability 1. *
Proof 3.3**.**
From the fact that is a metric and from Lemma 1.2 we conclude that (with probability 1) if and only if . So, if then by assumption and a.s. while
[TABLE]
*Thus in this case from some on with probability 1, from which moment we have . The opposite case is analogous. *
4 Impossibility of discrimination
The following problem is variously known as (process) discrimination, homogeneity testing or two-sample testing. For the asymptotic version we consider here the name process discrimination is more suited, and so this is the name we adopt in this section, reserving the name homogeneity testing for other versions.
Two series of observations and are presented sequentially. On each time step we would like to say whether the distributions generating the samples and are the same or different. In this section we are after an impossibility result, so we restrict the consideration to the case of the binary-valued processes.
Here we shall see that there is no asymptotically consistent discrimination procedure for the stationary ergodic processes with binary alphabet. The notion of consistency is perhaps the weakest one can think of: it is shown that for any discrimination procedure its expected answer does not converge to the correct one at least for some processes. In fact, a stronger result is established, showing that there is no asymptotically consistent discrimination procedure for a smaller set of process, namely, that of processes. The class of B-processes (formally defined below) is sufficiently wide to include, for example, -order Markov processes and functions thereof, but, on the other hand, it is a strict subset of the set of stationary ergodic processes.
The material of this section is after Ryabko:10discr . The additional definitions introduced ( processes, -distance) as well as the proof of the main theorem are not necessary for understanding the material of the subsequent chapters.
1 Setup and definitions
Let the alphabet be binary, . A discrimination procedure (or a homogeneity test) is a family of mappings , , that maps a pair of samples , into a binary (“yes” or “no”) answer: the samples are generated by different distributions, or they are generated by the same distribution.
A discrimination procedure is asymptotically consistent for a set of process distributions if for any two distributions independently generating the sequences and correspondingly the expected output converges to the correct answer: the following limit exists and the equality holds
[TABLE]
This is perhaps the weakest notion of correctness one can consider.
Clearly, asymptotically consistent discriminating procedures exist for many classes of processes, for example for the class of all i.i.d. processes (e.g. Lehmann:86 ) and various parametric families. Indeed, for i.i.d. samples one usually requires stronger forms of consistency than the asymptotic notion considered here.
To be able to define the set of -processes, we need to introduce another distance between process distributions, the distance.
For two finite-valued stationary processes and the -distance is said to be less than if there exists a single stationary process on pairs , , such that , are distributed according to and are distributed according to while
[TABLE]
The infimum of the ’s for which a coupling can be found such that (8) is satisfied is taken to be the -distance between and .
Definition 4.1**.**
*A process is called a -process (or a Bernoulli process) if it is in the -closure of the set of all aperiodic stationary ergodic -step Markov processes, where . *
For more information on -distance and -processes see Ornstein:74 .
2 The main result
Theorem 4.2**.**
*There is no asymptotically consistent discrimination procedure for the set of all -processes. *
Before presenting the proof, it is worth putting this result in the context of other results on -processes. As mentioned in the introduction, Ornstein and Weiss Ornstein:90 construct an estimator such that
[TABLE]
if both processes and generating the samples and respectively are -processes. In the same work it is shown that there is no estimator for which (9) holds for every pair of stationary ergodic processes.
Comparing these result to those on distributional distance presented in the previous section (namely, Lemma 1.2), we can say that the stronger the distance the harder it is to estimate: the distributional distance can be consistently estimated for stationary ergodic processes, the distance can be consistently estimated for -processes but not for stationary ergodic processes, while the strongest possible distance— the one that gives discrete topology, cannot be consistently estimated for -processes, as shown in this section.
It is also worth noting that the proof given below yields a slightly stronger results, namely, the impossibility of discrimination between finite-dimensional (including single-dimensional) marginals of the processes. Specifically, correctness of the discrimination procedure (7) can be replaced with the following
[TABLE]
with the same proof carrying over.
The proof, presented below, is by contradiction. It is assumed that a consistent discrimination procedure exists, and a process is exhibited that will trick such a procedure to give divergent results. The construction on which the proof is based uses the ideas of the “random walk over the diagonal” construction used in BRyabko:88 to demonstrate that consistent prediction for stationary ergodic processes is impossible (see also its exposition in Gyorfi:98 ).
Proof 4.3**.**
We will assume that asymptotically consistent discrimination procedure for the class of all -processes exists, and will construct a -process such that if both sequences and , are generated by then diverges; this contradiction will prove the theorem.
The scheme of the proof is as follows. On Step 1 we construct a sequence of processes , , and , where . On Step 2 we construct a process , which is shown to be the limit of the sequence , , in -distance. On Step 3 we show that two independent runs of the process have a property that (with high probability) they first behave like two runs of a single process , then like two runs of two different processes and , then like two runs of a single process , and so on, thereby showing that the test diverges and obtaining the desired contradiction.
Assume that there exists an asymptotically consistent discriminating procedure . Fix some and , to be defined on Step 3.
Step 1.* We will construct the sequence of process , , and , where .*
Step 1.0.* Construct the process as follows. A Markov chain is defined on the set of states. From each state the chain passes to the state [math] with probability and to the state with probability . With transition probabilities so defined, the chain possesses a unique stationary distribution on the set , which can be calculated explicitly using e.g. (Shiryaev:96, , Theorem VIII.4.1), and is as follows: , , for all . Take this distribution as the initial distribution over the states.*
The function maps the states to the output alphabet as follows: for every . Let be the state of the chain at time . The process is defined as for . As a result of this definition, the process simply outputs with probability on every time step (however, by using different functions we will have less trivial processes in the sequel). Clearly, the constructed process is stationary ergodic and a B-process. So, we have defined the chain (and the process ) up to a parameter .
Step 1.1.* We begin with the process and the chain of the previous step. Since the test D is asymptotically consistent we will have*
[TABLE]
from some on, where both samples and are generated by (that is, both samples consist of 1s only). Let be such an index that the chain starting from the state [math] with probability does not reach the state by time (we can take ).
Construct two processes and as follows. They are also based on the Markov chain , but the functions are different. The function is defined as follows: for and for . The function is identically (, ). The processes and are defined as and for . Thus the process will again produce only 1s, but the process will occasionally produce 0s.
Step 1.2.* Being run on two samples generated by the processes and which both start from the state 0, the test on the first steps produces many 0s, since on these first states all the functions , and coincide. However, since the processes are different and the test is asymptotically consistent (by assumption), the test starts producing 1s, until by a certain time step almost all answers are 1s. Next we will construct the process by “gluing” together and and continuing them in such a way that, being run on two samples produced by the test first produces 0s (as if the samples were drawn from ), then, with probability close to 1/2 it will produce many 1s (as if the samples were from and ) and then again 0s.*
The process is the pivotal point of the construction, so we give it in some detail. On step 1.2a we present the construction of the process, and on step 1.2b we show that this process is a -process by demonstrating that it is equivalent to a (deterministic) function of a Markov chain.
Step 1.2a.* Let be such a time index that*
[TABLE]
where the samples and are generated by and correspondingly (the samples are generated independently; that is, the process are based on two independent copies of the Markov chain ). Let be such an index that the chain starting from the state 0 with probability does not reach the state by time .
Construct the process as follows (see fig. 1).
It is based on a chain on which Markov assumption is violated. The transition probabilities on states are the same as for the Markov chain (from each state return to 0 with probability or go to the next state with probability ).
There are two “special” states: the “switch” and the “reset” . From the state the chain passes with probability to the “switch” state . The switch can itself have two values: and . If has the value then from the chain passes to the state with probability 1, while if the chain goes to , with probability 1. If the chain reaches the state then the value of is set to with probability 1/2 and with probability 1/2 it is set to . In other words, the first transition from is random (either to or to with equal probabilities) and then this decision is remembered until the “reset” state is visited, whereupon the switch again assumes the values and with equal probabilities.
The rest of the transitions are as follows. From each state , the chain passes to the state [math] with probability and to the next state with probability . From the state the process goes with probability to 0 and with probability to the “reset” state . The same with states : for the process returns to 0 with probability or goes to the next state with probability , where the next state for is the “reset” state . From the process goes with probability 1 to the state where from the chain continues ad infinitum: to the state 0 with probability or to the next state etc. with probability .
The initial distribution on the states is defined as follows. The probabilities of the states are the same as in the Markov chain , that is, , for . For the states and , define their initial probabilities to be 1/2 of the probability of the corresponding state in the chain , that is . Furthermore, if the chain starts in a state , , then the value of the switch is , and if it starts in the state then the value of the switch is , whereas if the chain starts in any other state then the probability distribution on the values of the switch is 1/2 for either or .
The function is defined as follows: for and (before the switch and after the reset); for all , and for all , . The function is undefined on and , therefore there is no output on these states (we also assume that passing through and does not increment time). As before, the process is defined as where is the state of at time , omitting the states and . The resulting process s illustrated on fig. 1.
Step 1.2b.* To show that the process is stationary ergodic and a -process, we will show that it is equivalent to a function of a stationary ergodic Markov chain, whereas all such process are known to be (e.g. Shields:96 ). The construction is as follows (see fig. 2). This chain has states and also and .*
From the states , the chain passes with probability to the next state , where the next state for is and with probability returns to the state (and not to the state 0). Transitions for the state are defined analogously. Thus the states correspond to the state of the switch and the states — to the state of the switch. Transitions for the states are defined as follows: with probability to the state , with probability to the state , and with probability to the next state. Thus, transitions to 0 from the states with indices greater than corresponds to the reset . Clearly, the chain as defined possesses a unique stationary distribution over the set of states and for every state . Moreover, this distribution is the same as the initial distribution on the states of the chain , except for the states and , for which we have , for . We take this distribution as its initial distribution on the states of . The resulting process is stationary ergodic, and a -process, since it is a function of a Markov chain Shields:96 . It is easy to see that if we define the function on the states of as 1 on all states except , then the resulting process is exactly the process . Therefore, is stationary ergodic and a -process.
Step 1..* As before, we can continue the construction of the processes and , that start with a segment of . Let be a time index such that*
[TABLE]
where both samples are generated by . Let be such an index that when starting from the state 0 the process with probability 1 does not reach by time (equivalently: the process does not reach when starting from either or ). The processes and are based on the same process as . The functions and coincide with on all states up to the state (including the states and , ). After the function outputs 0s while outputs 1s: , for .
Furthermore, we find a time by which we have where the samples are generated by and , which is possible since is consistent. Next, find an index such that the process does not reach with probability if the processes and are used to produce two independent sequences and both start from the state 0. We then construct the process based on a (non-Markovian) process by “gluing” together and after the step with a switch and a reset exactly as was done when constructing the process . The process is illustrated on fig. 3a). The process can be shown to be equivalent to a Markov chain , which is constructed analogously to the chain (see fig. 3b). Thus, the process is can be shown to be a -process.
Proceeding this way we can construct the processes , and , choosing the time steps so that the expected output of the test approaches 0 by the time being run on two samples produced by for even , and approaches 1 by the time being run on samples produced by and for odd :
[TABLE]
and
[TABLE]
For each the number is selected in a such a way that the state is not reached (with probability 1) by the time when starting from the state 0. Each of the processes , and , can be shown to be stationary ergodic and a -process by demonstrating equivalence to a Markov chain, analogously to the Step 1.2. The initial state distribution of each of the processes is and for those for which the corresponding states are defined.
Step 2.* Having defined , we can define the process . The construction is given on Step 2a, while on Step 2b we show that is stationary ergodic and a -process, by showing that it is the limit of the sequence , .*
Step 2a.* The process can be constructed as follows (see fig. 4).*
The construction is based on the (non-Markovian) process that has states , , and for , along with switch states and reset states . Each switch diverts the process to the state if the switch has value and to if it has the value . The reset sets to with probability 1/2 and to also with probability 1/2. From each state that is neither a reset nor a switch, the process goes to the next state with probability and returns to the state 0 with probability (cf. Step 1).
The initial distribution on the states of is defined as follows. For every state such that and , , define the initial probability of the state as (the same as in the chain ), and for the sets and (for those for which these sets are defined) let (that is, 1/2 of the probability of the corresponding state of ).
The function is defined as 1 everywhere except for the states (for all for which is defined) on which takes the value 0. The process is defined at time as , where is the state of at time .
Step 2b.* To show that is a -process, let us first show that it is stationary. Recall the definition 2 of the distributional distance between (arbitrary) process distributions. The set of all stochastic processes, equipped with this distance, is complete, and the set of all stationary processes is its closed subset Gray:88 . Thus, to show that the process is stationary it suffices to show that , since the processes , , are stationary. To do this, it is enough to demonstrate that*
[TABLE]
for each . Since the processes and coincide on all states up to , we have
[TABLE]
for every and . Moreover, for any tuple we obtain
[TABLE]
where the convergence follows from . We conclude that (13) holds true, so that and is stationary.
To show that is a -process, we will demonstrate that it is the limit of the sequence , in the distance (which was only defined for stationary processes). Since the set of all -process is a closed subset of all stationary processes, it will follow that itself is a -process. (Observe that this way we get ergodicity of “for free”, since the set of all ergodic processes is closed in distance, and all the processes are ergodic.) In order to show that we have to find for each a processes on pairs , such that are distributed according to and are distributed according to , and such that . Construct such a coupling as follows. Consider the chains and , which start in the same state (with initial distribution being ) and always take state transitions together, where if the process is in the state or , (that is, one of the states which the chain does not have) then the chain is in the state . The first coordinate of the process is obtained by applying the function to the process and the second by applying to the chain . Clearly, the distribution of the first coordinate is and the distribution of the second is . Since the chains start in the same state and always take state transitions together, and since the chains and coincide up to the state we have . Thus, , so that is a -process.
Step 3.* Finally, it remains to show that the expected output of the test diverges if the test is run on two independent samples produced by .*
Recall that for all the chains , and as well as for the chain , the initial probability of the state 0 is . By construction, if the process starts at the state 0 then up to the time step it behaves exactly as that has started at the state 0. In symbols, we have
[TABLE]
for , where and denote the initial states of the processes generating the samples and correspondingly.
We will use the following simple decomposition
[TABLE]
From this, (14) and (11) we have
[TABLE]
For odd indices, if the process starts at the state 0 then (from the definition of ) by the time it does not reach the reset ; therefore, in this case the value of the switch does not change up to the time . Since the definition of is symmetric with respect to the values and of each switch, the probability that two samples and generated independently by (two runs of) the process produced different values of the switch when passing through it for the first time is 1/2. In other words, with probability 1/2 two samples generated by starting at the state 0 will look by the time as two samples generated by and that has started at state 0. Thus
[TABLE]
for . Using this, (15), and (12) we obtain
[TABLE]
*Taking large and small (e.g. and ), we can make the bound (16) close to 0 and the bound (18) close to 1/2, and the expected output of the test will cross these values infinitely often. Therefore, we have shown that the expected output of the test diverges on two independent runs of the process , contradicting the consistency of . This contradiction concludes the proof. *
Chapter 3 Clustering and change-point problems
In the previous chapter we have considered some basic questions of statistical inference. It was established that, when speaking about stationary ergodic processes, one can answer questions like “which distribution is closer to which” but not “are these distributions the same,” based on samples. In this chapter we shall see how these questions come into play when considering more complex problems, namely, clustering and change-point problems.
Clustering is grouping together samples generated by the same distributions, while change-point problems are concerned with delimiting parts of a sample that are generated by a single process distribution. At first glance, it seems that this kind of questions should be impossible to solve, since we cannot even answer the simple “same-different” question about distributions. However, we shall see that often, and mainly in the case when the total number of different distributions is known, these questions can be reduced to answering the “which one is closer” question, and thus admit a solution.
All the algorithms that are mentioned in this chapter do not present any significant computational challenges, perhaps except for calculating the distributional distance (see Section 2 above about that). Therefore, we omit algorithmic and implementational details; the interested reader can find these in the corresponding papers that also present experimental evaluations of the algorithms: Khaleghi:15clust for clustering and Khaleghi:12mchp ; Khaleghi:14 ; Khaleghi:15chp for change-point problems. The material in this chapter is mainly after Ryabko:10clust ; Khaleghi:15clust for clustering and Ryabko:103s for change-point problems, with some results of Khaleghi:12mchp ; Khaleghi:14 ; Khaleghi:15chp ; Ryabko:17clin given without proofs.
1 Time-series clustering
Given a finite set of objects, the problem of “clustering” similar objects together, in the absence of any examples of “good” clusterings, is notoriously hard to formalize. Most of the work on clustering is concerned with particular parametric data-generating models, or with analysing particular algorithms, a given similarity measure, and (very often) a given number of clusters. It is clear that, as in almost learning problems, in clustering finding the right similarity measure is an integral part of the problem. However, even if one assumes the similarity measure known, it is hard to define what a good clustering is Kleinberg:02 ; Zadeh:09 . What is more, even if one assumes the similarity measure to be simply the Euclidean distance (on the plane), and the number of clusters known, then clustering may still appear intractable for computational reasons Mahajan:09 .
The problem acquires a different angle when one wishes to cluster processes. That is, each data point is itself a time-series sample. This version of the problem has numerous applications, such as clustering biological data, financial observations, or behavioural patterns, and as such it has gained a tremendous attention in the literature.
A crucial observation to make in the case of clustering processes, is that one can benefit from the notion of ergodicity to define what appears to be a very natural notion of consistency. Ergodicity means that the distribution of a sample can be determined in asymptotic, or approximated arbitrary well if the sample size is long enough. This makes the the following goal achievable. {svgraybox} Given samples , each drawn by one out of unknown process distributions, group together those and only those samples that were generated by the same distribution.
The samples are are not assumed to be drawn independently; rather, it is assumed that the joint distribution of the samples is stationary ergodic. The target clustering is as follows: those and only those samples are put into the same cluster that were generated by the same distribution. A clustering algorithm is called asymptotically consistent if it outputs only the correct answer with probability 1 from some on, where is the length of the shortest sample, Note the particular regime of asymptotic: not with respect to the number of samples , but with respect to the length of the samples .
Clearly, the problem of clustering in this formulation is a direct generalisation of the three-sample problem of Section 3. Indeed, the latter problem can be seen as clustering samples into clusters, where is given. At the same time, the discrimination problem of Section 4 can be seen as clustering samples into either or clusters, with unknown.
Anticipating, from this we can already see when it is possible and when it is not possible to have a consistent algorithm for clustering stationary ergodic time series. {svgraybox} There exists a consistent algorithm for clustering stationary ergodic time series if and only if the number of clusters is known.
We proceed below with a more formal problem formulation and the exposition of the algorithm.
1 Problem formulation
The clustering problem can be defined as follows. samples are given, where each sample is of length : . The samples are generated by a distribution over , that is, a distribution that generates an infinite sequence of -tuples.
[TABLE]
The marginal distribution of each sequence is one out of different (and unknown) stationary ergodic distributions . Note that we allow the samples to be dependent; the only requirement is on the marginal distributions (they should be stationary ergodic). Thus, there is a partitioning of the set into disjoint subsets
[TABLE]
such that , is generated by if and only if . The partitioning is called the target (or ground-truth) clustering and the sets , are called the target clusters. Given samples and a target clustering , let denote the cluster that contains .
A clustering function takes a finite number of samples and a parameter (the target number of clusters) and outputs a partition of the set .
Definition 1.1** (asymptotic consistency).**
Let a finite number of samples be given, and let the target clustering partition be . Define . A clustering function is strongly asymptotically consistent if
[TABLE]
from some on with probability 1. A clustering function is weakly asymptotically consistent if
[TABLE]
Note that the consistency is asymptotic with respect to the minimal length of the sample, and not with respect to the number of samples.
2 A clustering algorithm and its consistency
Here we present an algorithm that is shown to be asymptotically consistent in the general framework introduced. What makes this simple algorithm interesting is that it requires only distance calculations (where is the number of clusters), that is, much less than is needed to calculate the distance between each two sequences.
In short, Algorithm 1 initialises the clusters using farthest-point initialisation, and then assigns each remaining point to the nearest cluster. More precisely, the sample is assigned as the first cluster centre. Then a sample is found that is farthest away from in the empirical distributional distance and is assigned as the second cluster centre. For each the cluster centre is sought as the sequence with the largest minimum distance from the already assigned cluster centres for . By the last iteration we have cluster centres. (This initialisation procedure was proposed in Katsavounidis:94 in the context of -means clustering.) Next, the remaining samples are each assigned to the closest cluster.
Theorem 1.2**.**
*Algorithm 1 is strongly asymptotically consistent provided that the correct number of clusters is known, and the marginal distribution of each sequence is stationary ergodic. *
To main idea of the proof is as follows. Lemma 1.2 implies that, if the samples in are long enough, the samples that are generated by the same process distribution are closer to each other than to the rest of the samples. Therefore, the samples chosen as cluster centres are each generated by a different process distribution. The theorem then follows from the fact that the algorithm assigns the rest of the samples to the closest clusters.
Proof 1.3**.**
Let denote the shortest sample length in :
[TABLE]
Denote by the minimum nonzero distance between the process distributions:
[TABLE]
Fix . Since there are a finite number of samples, by Lemma 1.2 for all large enough we have
[TABLE]
where denote the ground-truth partitions. By (2) and applying the triangle inequality we obtain
[TABLE]
Thus, for all large enough we have
[TABLE]
where the first inequality follows from the triangle inequality, and the second inequality follows from (2) and the definition of . In words, (3) and (1.3) mean that the samples in that are generated by the same process distribution are closer to each other than to the rest of the samples. Finally, for all large enough to have (3) and (1.3) we obtain
[TABLE]
*where, as specified by Algorithm 1, and . Hence, the indices will be chosen to index sequences generated by different process distributions. To derive the consistency statement, it remains to note that, by (3) and (1.3), each remaining sequence will be assigned to the cluster centre corresponding to the sequence generated by the same distribution. *
3 Extensions: unknown , online clustering and clustering with respect to independence
In this section we briefly consider several extensions and modifications of the process clustering problem. The problems are only outlined, and the details are left out; the interested reader is referred to the corresponding papers that treat each of these problems in detail.
Unknown number of clusters
As mentioned in the beginning of this section, if the number of clusters is unknown, then the problem provably has no solution. Thus, if we really want to have a consistent algorithm that does not require , then something has to give in. Sacrificing the generality is one way of doing it. Clearly, if we assume that the speed of convergence of frequencies has a known upper-bound, as is the case when time-series are i.i.d. or mixing (with a bound on the mixing coefficient) then everything becomes possible. The resulting time-series clustering problem is still interesting, but clearly falls out of the scope of this volume. A simple example of an algorithm that is consistent in this setting can be found in Ryabko:10clust ; Khaleghi:15clust . It is worth noting that it remains open to establish tight upper- and lower-bounds on the error probability of clustering algorithms even for the case of i.i.d. time series.
Online clustering
An interesting and practical modification of the clustering problem consists in taking it “online.” On each time step, new samples are revealed, which can be either a continuation of some of the time-series available on the previous steps, or form a new time series. The asymptotic setting commands that the length of each time series should grow to infinity, as should the number of time series, though they may do so in an arbitrary manner. As before, the only requirement we would like to make is that the marginal distribution of each of the processes is stationary and ergodic. There are only different marginal distributions, the number of these distributions is known but this is all the information we get.
Let us describe the problem a little more formally. Consider the two-way infinite matrix of -valued random variables
[TABLE]
generated by some probability distribution on , where is the corresponding Borel sigma-algebra. The matrix can be seen as an infinite sequence of infinite sequences; since is a standard probability space, so is and thus is well-defined (e.g., Gray:88 ).
Assume that the marginal distribution of on each row of is one of unknown stationary ergodic process distributions . Thus, the matrix corresponds to infinitely many one-way infinite sequences, each of which is generated by a stationary ergodic distribution. Aside from this assumption, we do not make any further assumptions on the distribution that generates . This means that the rows of (corresponding to different time-series samples) are allowed to be dependent, and the dependence can be arbitrary; one can even think of the dependence between samples as adversarial. For notational convenience we assume that the distributions are ordered based on the order of appearance of their first rows (samples) in .
As in the offline setting, the ground-truth partitioning of is defined by grouping the rows that have the same marginal distribution. Let
[TABLE]
be a partitioning of into disjoint subsets , such that the marginal distribution of , is for some if and only if . The partitioning is called the ground-truth clustering.
Introduce also the notation for the restriction of to the first sequences:
[TABLE]
At every time step , a part of is observed corresponding to the first rows of , each of length , i.e.
[TABLE]
We assume that the number of samples, as well as the individual sample-lengths grow with time. That is, the length of each sequence is nondecreasing and grows to infinity (as a function of time ). The number of sequences also grows to infinity. Aside from these assumptions, the functions and are completely arbitrary.
An algorithm is called asymptotically consistent in the online setting, if, for every w.p.1 from some point on the clustering output by the algorithm coincides with the ground-truth on the first samples, i.e. .
It turns out that this setting admits a consistent clustering algorithm.
Theorem 1.4**.**
*There exists an algorithm that is asymptotically consistent in the online setting, provided that the marginal distribution of each sequence is stationary ergodic. *
The proof of this theorem, along with the corresponding algorithm, can be found in Khaleghi:15clust .
It is worth noting that the main challenge in constructing such an algorithm is the fact that, on every time step , we do not know whether all of the different distributions are already present, or the are generated by fewer than different distributions. The solution is based on a weighted average of clusterings, each constructed based on the first rows, with carefully selected weights.
Clustering with respect to independence
The clustering problem considered in the previous sections may be seen as clustering with respect to distribution: putting together those and only those samples that are generated by the same distribution. Another way to look at clustering time series is grouping them with respect to (in)dependence. Thus, the problem is as follows. {svgraybox} Given a set of samples, it is required to find the finest partitioning of into clusters such that the clusters are mutually independent.
The formal model is the same as in clustering with respect to distribution: the probability distribution is that on the space of infinite sequence of -tuples (1). However, in this setting we require the joint distribution to be stationary ergodic, whereas before we only had to put this constraint on the martinal distribution of the samples.
What makes this problem very different from the previous one, and, in fact, from the rest of the problems considered in the clustering literature, is that, since mutual independence is the target, pairwise similarity measurements are of no use. Therefore, traditional clustering algorithms are inapplicable, since they are based on calculating some distance between pairs of objects (in the case of the previous sections, time-series samples) .
Thus, to solve this problem we have to go back to the first principles and first consider what should we do if the joint distribution of all the samples is known. After that, it is instructive to consider i.i.d. samples, before turning to stationary ergodic distributions. While the detailed considerations of this problem takes us outside the scope of this volume, here it is worth mentioning in which cases a solution to this problem exists, and some ideas behind it.
For stationary ergodic distributions a consistent algorithm can be constructed provided the correct number of clusters is known. The algorithm is based on calculating empirical estimates of the following measure of independence between groups of samples. In the expression below, stands for Shannon entropy, and is a quantization of the random variable in question to the cells of a partition similar to but finite.
Definition 1.5** (sum-information).**
For stationary processes define the sum-information
[TABLE]
This quantity has certain similarities to the distributional distance: it is also a weighted sum of certain discrepancies between marginal distributions of growing dimension. However, instead of simple differences in probabilities, we are using entropy, and whereas before we were considering only pairs of random variables, here we have generalized this to groups of arbitrary sizes. Note also that this is not an estimator but a theoretical quantity; to estimate it empirically, one replaces the probabilities with the corresponding frequencies.
The details of the algorithms and proofs can be found in Ryabko:17clin . It is worth noting that the online version of this problem (akin to the one considered in Section 3) so far remains unexplored.
2 Change-point problems
Change-point problems are concerned with sequences in which the distribution of the data changes over time in an abrupt manner. The latter means that the sequence can be divided into segments, such that each segment is generated by a single time-series distribution, and between the segments the distributions are different.
It is another classical problem, with vast literature on both parametric (see e.g. basseville:93 ) and non-parametric (see e.g. brodsky:93 ) methods for solving it. As usually in statistics, most literature deals with the case of i.i.d. data within each segment, with generalisations to dependent data reaching up to and including distributions with mixing brodsky:93 ; Giraitis:95 . The important exception is the work Carlstein:93 , which considers stationary ergodic sequences. The latter work makes a further assumption that the single-dimensional marginals (of ) before and after the change point are different. As was shown in Ryabko:103s , this assumption is not necessary; here, as in the preceding sections, we follow this latter approach.
Change-point problems can be roughly divided into estimation problems and detection problems. To better explain this, consider the case of a single change. A sample is given, where, for a certain , are generated according to some distribution and are generated according to some distribution . Change-point estimation is about finding the parameter (or, equivalently, the change point ), knowing that it exists, that is, knowing that . On the other hand, detection problems are concerned with determining whether there is a change point in the first place, that is, finding out whether . Various formulations exist, mainly focusing on detecting the change quickly after it appears.
Given the results of the preceding sections, it should be clear at this point that if all we know is that all the distributions in question are stationary and ergodic, then it is, in general, not possible to tell whether there is a change point in the sequence or not. Thus, we will be only concerned with change-point estimation problems.
Another point that needs to be clarified is the asymptotic regime that we are using. We are working with a single sample of a fixed size, ,
[TABLE]
yet the statements will be about what happens when goes to infinity. In fact, we are talking about two samples whose lengths grow to infinity. If we imagine them being stuck together and each increasing in length to the right, then this would somehow make the change point obvious each time the length of sample to the left of it increments. This is why we are not considering an “online” setting where the samples would grow. Rather, we are considering only an “offline” version, where the sample is fixed. In this setting, saying that, for example, the estimate approaches as grows to infinity simply means that for large enough , is arbitrarily close to , and does not mean that the algorithm is dealing with samples of increasing sizes.
An important constraint, which is present in one way or another in all the change-point models, is on how far a change point can be from the boundaries of the samples. Indeed, if, say is generated by one distribution but already by another, so the change point occurs at time step 2, then hardly any algorithm can make any meaningful inference. A common way to tackle this is to require the size of each segment (generated by a single distribution) to be linear in the length of the whole combined sample, . This is made explicit in the formulations we adopt, where we refer to change points as , and the goal is to estimate . Moreover, given the fact that there are no speeds of convergence available for (stationary) ergodic distributions, this requirement is essential, since the initial part of any sample can be effectively arbitrary, whatever function one assumes in that . Thus, we can state the following. {svgraybox} Consistent change-point estimation algorithms for stationary ergodic processes are only possible under the constraint that the length of each segment generated by a single distribution is linear in the total sample size .
In this chapter we treat in detail the case of a single change point. Extensions to multiple change points are given without proofs, referring the interested reader to the corresponding papers. However, it is worth noting that, in spite of the impossibility of discriminating between processes and thus detecting a change point, the case of an unknown number of change points is not entirely hopeless, and in fact in some cases admits a solution that does not require putting further restrictions on the distributions generating the data.
1 Single change point
The sample is the concatenation of two parts and , where , so that for and for . The samples and are generated by two different stationary ergodic processes with alphabet . The distributions of the processes are unknown. The value is called the change point. Moreover, in this first setting, we assume that is bounded away from 0 and from 1 with known upper and lower bounds: for some known (for sufficiently large ). In the next setting we shall discuss how to get rid of this assumption.
It is required to estimate (or, equivalently, the change point ) based on the sample .
For each , , denote the sample consisting of the first elements of the sample , and denote the remainder .
Definition 2.1** (Change point estimator).**
Define the change-point estimate as follows:
[TABLE]
The following theorem establishes asymptotic consistency of this estimator.
Theorem 2.2**.**
For the estimate of the change point we have
[TABLE]
*where is the size of the sample. *
Proof 2.3**.**
Denote . To prove the statement, we will show that, for every , , with probability 1 the inequality holds for each such that , possibly except for a finite number of times (in ). Thus we will show that linear -underestimates occur only a finite number of times, and for overestimate it is analogous. Fix some , and . Let be big enough to have and also big enough to have an index for which . Take large enough to have for all and for each , , and also to have for each , . This is possible since empirical frequencies converge to the limiting probabilities a.s.; note that depends on (cf. the proof of Lemma 1.2). Find a (that depends on ) such that for all and for all , we have
[TABLE]
(this is possible simply because ). Furthermore, we can select large enough to have
[TABLE]
for each : this follows from (7) and the identity
[TABLE]
So, for each we have
[TABLE]
for and (from the definitions of and ). Hence
[TABLE]
for some that depends only on and . Summing over all , , we get
[TABLE]
*for all such that and , which is positive for small enough . *
2 Multiple change points, known number of change points
The following generalization is considered in this section. First, the number of change points is allowed to be arbitrary, though it still has to be known. Second, we get rid of the assumption that there is a known lower bound on the distance between a change point and the sequence boundaries (its start and its end), as well as between change points.
The details of the algorithm and the proof of its consistency are omitted and can be found in Khaleghi:15chp .
The problem is as follows. A sample
[TABLE]
is given, which is formed as the concatenation of non-overlapping segments, where and . Each segment is generated by some unknown process distribution. The distributions that generate every pair of consecutive segments are different. The parameters specifying the change points are unknown and have to be estimated. The distributions that generate the segments are unknown, but are assumed to be stationary ergodic. A formal probabilistic model for this process is via considering the matrix of random variables (1), where the marginal distribution of each row is stationary ergodic. The sample is then formed by concatenating parts of these rows.
Denote for convenience and and define the minimal distance between change points as
[TABLE]
Let us first assume that there is a known lower bound on this parameter: . Then, knowing this lower bound and the number of change points , one can construct a consistent algorithm as follows.
Break the whole sample into short consecutive segments each of which cannot contain more than one change point (the actual algorithm, proposed in Khaleghi:15chp , uses segments of length ). Find a candidate change point in each of the segments, using the single-change-point algorithm of the previous section. Then, select of these candidate change-points that maximize the following scoring function. The scoring function takes an arbitrary segment in the sample and measures how close to each other (in the distributional distance) its first and second halves are
[TABLE]
The reason the algorithm works is as follows. The single-change point estimates are consistent in the case there is exactly one change point in the segment they are applied to; this can be demonstrated in the same way the single change-point estimator was proven consistent in the previous section. Next, the segments that do not contain any change point will see their score (10) converge to 0, while those that do contain a change point, to a non-zero constant. Since we know how many change points there are, it suffices to select the highest-scoring ones.
The next step is to get rid of the requirement of a known bound on . This is done by constructing a series of -tuples of change-point estimators, each for a different value of a candidate , which are then combined with carefully selected weights. This gives the following theorem.
Theorem 2.4**.**
*There exists an algorithm for finding change points that is asymptotically consistent provided each segment is generated by a stationary ergodic distribution and is known. *
{svgraybox}
For stationary ergodic time series, asymptotically consistent estimation of multiple change points whose number is known is possible without any extra assumptions, besides that the length of each segment is linear in the sample size .
3 Unknown number of change points
The result on impossibility of process discrimination (Section 4) implies that it is provably impossible to distinguish between the cases of 0 and 1 change point for stationary ergodic samples. Yet, it appears impractical to assume that the exact number of change points is given to an algorithm. Thus, a search for other, more constrained, formulations is warranted. Two such formulations are briefly considered here: providing an exhaustive list of change points, and the case of a known number of different distributions but an unknown number of change points. The details of the algorithms and proofs are left out, and can be found in the corresponding papers Khaleghi:12mchp ; Khaleghi:14 . In both of these formulations we assume a known lower bound on the distance between the change points (9).
It is worth making a distinction with the related problem of clustering. In that problem, if the number of clusters is unknown, then all we can do is to resort to more restrictive assumptions on the process distributions (see Section 3). On the other hand, for the change-point problem, it is still possible to get around the fact that the number of change points is unknown, while only assuming that the process distributions are stationary ergodic. Specifically, one formulation that allows us to do it is assuming that the total number of distributions is known (Section 3 below). Indeed, the number of distributions defines the number of clusters in the clustering problem, but, in change-point problems, still allows the number of change points to be arbitrary.
Listing change points
Not knowing the number of change points, one could try to provide a ranked list of change points, that should include all the “true” change points, and possibly also other, spurious, points. The longest such a list could be is , the size of the sample; or if we assume that the minimum distance between the change points is lower-bounded by . Such a ranked list could be useful if we knew that the first listed change points are the true change points, even if the rest of the listed points are extraneous. It turns out that this is indeed achievable. The algorithm is very similar to the one of the preceding section, with (10) used as a ranking function and the single-change-point algorithm used to find candidate change points in each of the segments. The result one can obtain is thus the following Khaleghi:12mchp .
Theorem 2.5**.**
*There exists an algorithm that, given a sample (8) generated by stationary ergodic distributions, provides a list of change-point candidates which has the property that, with probability 1 as goes to infinity, from some on its first elements are within of the change points , . *
Known number of distributions, unknown number of change points
A sample with change points can be, in general, generated by different distributions. However, it can be generated by fewer distributions too, for example, by two distributions only irrespective of the value of . This formulation with the total number of distributions smaller than may make sense in various applications. For example, imagine a text written by two authors each of which wrote many different parts of the text. Here the number of distributions is 2 and is known a priori, but the number of change points may be large and unknown.
It turns out that if the number of change points is unknown, it is still possible to locate them, if the total number of distributions generating the segments is known. Here as well we assume a known lower-bound on the minimal distance between change points. The algorithm starts by producing an exhaustive list of change points with the algorithm of the previous section. It then clusters all the resulting segments of the sample into clusters, where is the number of different distributions that is assumed given. The clustering algorithm can be chosen to be that of Section 2, with as the target number of clusters. This result in the following statement.
Theorem 2.6**.**
There exists an algorithm that, given a sample (8) generated by different stationary ergodic distributions, the number and a lower-bound on , provides an estimate and a list of change points estimates , that are asymptotically consistent:
[TABLE]
and
[TABLE]
*for *
The details of the algorithm and proofs can be found in Khaleghi:14 .
In conclusion, we can formulate the following statement. {svgraybox} For stationary ergodic distributions generating the data between change points whose number is unknown, it is possible to find the correct number and provide consistent estimates of the change points if (and only if) the total number of different distributions is known.
Chapter 4 Hypothesis testing
Given a sample , we wish to decide whether it was generated by a distribution belonging to a family , versus it was generated by a distribution belonging to a family . As before, the only assumption we are willing to make about the the distribution generating the sample is that it is stationary ergodic.
In this chapter where we assume that are from a finite alphabet . Moreover, unlike in the previous chapters, in this one we shall delve a little deeper into the theory of stationary processes, and use some of its facts other than the simple convergence of frequencies. In particular, it will be of essence that the space of stationary processes is compact with the topology of the distributional distance: a fact that holds for finite-alphabet processes with distance (4) but not for real-valued processes with the distance (5).
The material of this chapter mainly follows Ryabko:121c ; Ryabko:141u .
1 Introduction
A test is a function that takes a sample and gives a binary (possibly incorrect) answer: the sample was generated by a distribution from or from . An answer is correct if the sample is generated by a distribution that belongs to , and otherwise the test is said to make an error. It often makes sense to distinguish between two types of error, depending on which of the hypotheses holds true. Thus, we say that the test makes a Type I error if is true but the test says is true, and we say that the test makes Type II error if the opposite takes place: the test says while is true. Note that in case neither nor holds true the output of the test may be arbitrary and we are not speaking about any kind of error; generally, one cannot say anything about the behaviour of the test in such a case.
Here we are concerned with the general question of characterizing those pairs of and for which consistent tests exist.
Several notions of consistency are considered. For two of these notions of consistency we find some necessary and some sufficient conditions for the existence of a consistent test, expressed in topological terms. The topology is that of distributional distance in the form (4). For one notion of consistency, namely, for asymmetric consistency, the necessary and sufficient conditions coincide when is the complement of , thereby providing a complete characterization. This suggests that the topology of the distributional distance is indeed the right one to study these problems.
Each of the notions of consistency considered has been studied extensively (sometimes in slightly different formulations) for i.i.d. data. It is thus instructive to provide characterisations of those hypotheses for which consistent tests exist for this more restrictive model and see how it relates to the general case of stationary ergodic time series, which we do in this chapter whenever possible.
In the rest of this section we consider various examples of the problem of hypothesis testing that motivate studying it in the general form; we also introduce various notions of consistency used. In the next section, a simple example of hypothesis testing is considered in some detail, exposing various concepts used, including the notions of consistency, the topological criteria for consistency in simpler spaces and the role of the ergodic decomposition.
1 Motivation and examples
Before introducing the definitions of consistency, let us give some examples motivating the general problem in question. Most of these examples are classical problems studied in mathematical statistics and related fields, mostly for i.i.d. data, with much literature devoted to each of them. The classical Neyman-Pearson formulation of the hypothesis testing problem is testing a simple hypothesis versus a simple hypothesis , where and are two distributions that are completely known. A more complex but more realistic problem is when only one of the hypothesis is simple, but the alternative is general, for example, in our framework could be the set of all stationary ergodic processes that are different from . This is the so-called goodness-of-fit or identity-testing problem. Here would typically be some specific distribution of interest, such as the Bernoulli i.i.d. distribution with equal probabilities of outcomes.
Generalizing the latter example is the class of hypothesis testing problems that can be described as model verification problems. Suppose we have some relatively simple (possibly parametric) set of assumptions, and we wish to test whether the process generating the given sample satisfies this assumptions. As an example, can be the set of all -order Markov processes (fixed ) and is the set of all stationary ergodic processes that do not belong to ; one may also wish to consider more restrictive alternatives, for example is the set of all -order Markov processes for some . Of course, instead of Markov processes one can consider other models, e.g. hidden Markov processes. A similar problem is that of testing that the process has entropy less than some given versus its entropy exceeds , or versus its entropy is greater than for some positive .
Yet another type of hypothesis testing problems concerns property testing. Suppose we are given two samples, generated independently of each other by stationary ergodic distributions, and we wish to test the hypothesis that they are independent versus they are not independent. Or, that they are generated by the same process versus they are generated by different processes.
In all the considered cases, when the hypothesis testing problem turns out to be too difficult (i.e. there is no consistent test for the chosen notion of consistency) for the case of stationary ergodic processes, one may wish to restrict either , or both and to some smaller class of processes. Thus, one may wish to test the hypothesis of independence when, for example, both processes are known to have finite memory, but the alternative is allowed to be general: the complement of the set to the set of stationary ergodic processes (on pairs).
2 Types of consistency
There are different types of consistency of tests, corresponding to how strong a guarantee one wishes to have on the probability of error. Three notions of consistency are considered here: uniform, asymmetric (or -level), and asymptotic consistency. They represent different trade-offs between the strength of the guarantees one can obtain and the generality of hypotheses pairs for which consistent tests exist.
1 Uniform consistency
We start with what appears the strongest notion, uniform consistency. It requires both probabilities of error to be uniformly bounded. More precisely, uniform consistency requires that for each there exist a sample size such that probability of error is upper-bounded by for samples longer than .
Definition 2.1** (uniform consistency).**
*A test is called uniformly consistent if for every there is an such that for every the probability of error on a sample of size is less than : for every and every . *
This notion of consistency has been extensively studied in the algorithms community for i.i.d. data under a slightly different formulation: the probability of each error is required to be bounded by a fixed number, typically 1/3, and the problem is to find minimal sample sizes necessary to achieve this error. The interpretation is that if one can get 1/3 probability of error then one can make it arbitrary small by taking more (independent) samples; see, for example, Goldreich:98 ; Batu:01 ; Batu:04 ; Guha:06 . The definition above is adapted for dependent data.
For i.i.d. samples it is easy to establish a criterion for the existence of a consistent test: there exists a uniformly consistent test if and only if are contained in closed non-overlapping sets. Here the topology is just that of the Euclidean distance on the space of parameters defining the distributions over . Indeed, to see that the condition is necessary, it is enough to notice that the sets of distributions satisfying are closed for any fixed and , in particular for . On the other hand, to construct a test it is enough to take a neighbourhood over (say) of radius that slowly decreases with : for large enough the neighbourhood will not intersect (since both sets are closed), and one can use concentration of measure results for i.i.d. distributions to show that if the radius decreases slow enough then the test is consistent. From this description it is clear that the some generalizations to processes with mixing are possible. See also Csiszar:04 for related results (for i.i.d. data).
2 Asymmetric consistency
The next notion of consistency is the classical one used in mathematical statistics (e.g.,Lehmann:86 ; Kendall:61 ): the probability of Type I error is fixed at the given level , and the probability of Type II error goes to 0. This definition is well-suited for pair of hypotheses that are by nature asymmetric, such as singleton , or hypotheses where is the complement to , for example, “the distribution belongs to a given parametric model” versus ”it is stationary ergodic but not in the model,” or the examples considered in this work: “distributions generating a pair of samples are independent” versus they are not, or “distributions are the same” versus they are not. The definition is as follows.
Definition 2.2** (Asymmetric consistency).**
Call -level test asymmetrically consistent as a test of against if:
- (i)
The probability of Type I error is always bounded by : for every , every and every , and
- (ii)
Type II error is made not more than a finite number of times with probability 1: for every and every .
Similar to the case of uniform consistency, here it is easy to see what is the criterion for the i.i.d. samples. There exists an asymmetrically consistent test if and only if does not intersect ; see the next section for a more detailed explanation, and also Csiszar:04 for this and related results.
3 Asymptotic consistency
Finally, what appears to be the weakest notion of consistency is perhaps the simplest to formulate: the error (of each type) has to be made finitely many times w.p.1.
Definition 2.3** (asymptotic consistency).**
*A test is called uniformly consistent if for every , we have -a.s. *
This weakest notion of consistency gives strongest negative results, which is why we used it in the Section 4 to show that there is no consistent test for homogeneity (process discrimination).
For real-valued i.i.d. samples, where the hypotheses are formulated about the means of the distributions, this notion has been studied in Dembo:94 . For the case of distributions with finite moments with the following criterion is obtained: there exists an asymptotically consistent test if and only if and are contained in disjoint sets. (A set is if it is a countable union of closed sets.) It can be seen that the same criterion holds in our case (finite-valued distributions) if the samples are i.i.d.
This notion of consistency has been given considerable attention in the time-series literature, perhaps because it is rather weak and thus appears more suited for time-series analysis. In particular, some specific hypotheses have been studied in Ornstein:90 ; Morvai:05 . For the general case of stationary ergodic distributions, Nobel:06 obtains a generalization of the results of Dembo:94 , providing some sufficient conditions for the existence of a consistent test for real-valued processes, in terms of the topology of weak convergence.
4 Other notions of consistency
Many other notions of consistency exist in statistics and related fields. For example, a variation on the notion of asymmetric consistency common in the literature is requiring the probability of Type I error to be bounded by only in asymptotic. Most of other notions of consistency are focussed on speeds of convergence and thus are of little interest in our context. For example, one can require the probability of error (of each type) to decrease exponentially fast; see Csiszar:04 for some characterisations.
3 One example that explains hypotheses testing
Let us consider a rather simple example that illustrates various concepts used and difficulties encountered. The example will be that of homogeneity testing (or process discrimination) for binary-valued () processes; we will consider i.i.d. processes and Markov chains, in addition to stationary ergodic distributions. For the i.i.d. case, it is easy to find a e topological characterisation of those hypotheses for which consistent tests exist, so we do this for illustrative purposes. The example hypothesis considered here, homogeneity testing, is the problem we have addressed in Section 4 for asymptotic consistency in the general case. Here the main focus is on a stronger notion of consistency, namely asymmetric consistency, and on simpler processes. The goal is to illustrate the topological conditions that characterize the existence of consistent tests. The Markov case already shows why ergodic decomposition plays such an important role in finding the criteria for the existence of tests.
1 Bernoulli i.i.d. processes
Before considering dependent time series, let us see what would be the criterion for the existence of an asymmetrically consistent test for i.i.d. data, and apply it to our example of homogeneity testing.
Thus, we are speaking about Bernoulli distributions. Each such distribution can be identified with the parameter , and each hypothesis with a subset of the parameter space . Recall that a test , which receives an additional parameter , is said to be asymmetrically consistent, if, for every sample size the and every probability of Type I error (that is, error under ) is upper-bounded by , while the probability of Type II error (error under ) goes to 0. It is easy to see that there exists an asymmetrically consistent test if and only if does not intersect . Here the topology is just that of the Euclidean distance on the parameter space. Indeed, to see that the condition is necessary, it is enough to notice that the sets of distributions satisfying are closed for any fixed and , and in particular for . Thus, if, for the given sample size , the probability that the test says is upper-bounded by for every (Type I error) then the same holds for every . We have shown that it is necessary for to be closed in order for an asymmetrically consistent test against the complement to exist. To show sufficiency, we need to construct a test for an arbitrary closed . To do so, consider a closed set and the closed set of parameters that defines it. Take a sequence of neighbourhoods over of such radii that, for every and , the probability of samples of size that the frequency of 0 falls into equals . Note that the radius of these neighbourhoods decreases with (because of the law of large numbers), which means that for every distribution there is a large enough such that (the parameter that defines) is outside . This implies that the Type II error goes to 0.
The hypothesis of homogeneity is formulated for and states that their distributions are equal. Thus, we are speaking about distributions on pairs of samples (which, for the sake of simplicity, we consider independent). For Bernoulli distributions, this is a two-parameter space . The hypothesis is the diagonal , which is of course closed, and so a consistent test exists. Similarly, for uniform consistency the criterion is that . Thus, there is no uniformly consistent test for homogeneity, and, more generally, there is no uniformly consistent test for any against its complement. If we want to have a uniformly consistent test for homogeneity, we need to change the alternative hypothesis . For example, change to “the distributions differ by at least .” This ensures the existence of a uniformly consistent test at the cost of creating an -buffer zone between and , in which, in general, we cannot say anything about the behaviour of a test.
2 Markov chains
Moving on to the case of two-state Markov chains, we have now two -valued parameters: the probabilities to change the state. As before, the state space is binary: . Let us try to guess that the criterion for the existence of an asymmetrically consistent test is the same as in the i.i.d. case: there exists a consistent test iff , with the Euclidean topology of the parameter space, and let us look what it gives for the hypothesis of homogeneity. Consider a specific set of Markov chains, call it . These are defined so that the probability to change a state (from 0 to 1 as well as from 1 to 0) for the chain is , and the initial distribution is given by . When goes to [math], the limit of (in the space of parameters) is . The latter is a stationary distribution which is a mixture of two Dirac distributions and : one concentrated on the sequence of 0s and the other on the sequence of 1s. This is the ergodic decomposition of : . Note that for are stationary and ergodic, but is stationary but not ergodic. And here lies the source of the trouble. For the hypothesis of homogeneity, consider the pair of distributions . When , the limit is , which is the mixture
[TABLE]
Call this mixture . Note that, under the distribution , with probability 1/2 we observe two different sequences, one is all 0s and the other all 1s. In other words, under the ergodic decomposition of , with probability 1/2 we observe two different distributions, either or , so that . Nonetheless, the distribution itself is of course in .
Let us now demonstrate that there is no asymmetrically consistent test for against its complement to the set of Markov chain distributions. As in the i.i.d. case, the sets of distributions satisfying are closed for any fixed and , and in particular for . Thus, for any test and any given sample size , if the sets on which the test says (makes Type I error) have probability at most with respect to every for , then they also have probability at most under the distribution . The latter distribution, however, is concentrated on four pairs of -tuples . This means that for the test must say that the distributions are the same when presented with at least one of the pairs of samples or . Since this happens for every , we conclude that any such test is inconsistent: its Type II error does not go to 0: it is at least 1/4 for infinitely many under at least one of the distributions or .
Thus, we have shown that there is no asymmetrically consistent test for homogeneity for (stationary ergodic) Markov chains. The reason for this is that, while the set is closed, it is not closed under ergodic decompositions. Specifically, there exists a distribution (namely, ), whose ergodic decomposition is such that . Ergodic decompositions of the limit points of is what we need to take care of in the general case of stationary ergodic distributions.
As the last word about homogeneity testing for Markov chains, let us note that, unlike for stationary ergodic distributions, there exists an asymptotically consistent test for this hypothesis for this set of processes. Indeed, ergodic Markov chains mix exponentially fast (e.g., hernandez:03 ), which is enough to construct a test, considering sets around that shrink sufficiently slowly. An example of such an algorithm for the more general problem of clustering distributions with mixing can be found in Khaleghi:15clust .
3 Stationary ergodic processes
Finally, let us pass to the general case of stationary ergodic distributions. The topology of the distributional distance that we work with is a direct generalisation of the Euclidean topology of the parameter spaces on the Bernoulli and Markov distributions that we considered. In fact, the topology induced by the distributional distance on these parameter spaces is exactly the same.
As we have seen in the Markov case, the main problem is with the limit points of and their ergodic decompositions. More generally, while the set of stationary processes is closed in the topology of the distributional distance, the set of stationary ergodic distributions is not (its closure is ). This parallels the situation with Markov chains: the closure of the set of stationary ergodic Markov chains is the set of all stationary Markov chains.
For the case of asymmetric consistency for stationary ergodic processes, the pinnacle result presented in this chapter is the following criterion: there exists an asymmetrically consistent test of against its complement if and only if has probability 1 with respect to the ergodic decomposition of every process in the closure of . This is a corollary of the more general result presented in this chapter for the case when is not necessarily the complement of ; however the condition only becomes “if and only if” in the case of the complement. This result can be directly applied to the hypothesis of homogeneity testing to show that there is no asymmetrically consistent test against its complement: indeed, the proof that is not closed under taking ergodic decompositions is by the Markov example of the previous subsection.
4 Topological characterizations
In this section we formulate our criteria for the existence of consistent tests, and give constructions of the tests which are consistent if and only if consistent tests exist.
These constructions are not exactly algorithms, since one can hardly talk about algorithms whose input is an arbitrary set of distributions. However, the tests specify what should be estimated and how the decision should be made. Therefore, we provide procedures that work if anything works at all; turning them into efficient algorithms for specific problems is an interesting direction for further research.
The tests presented below are based on empirical estimates of the distributional distance. We shall first generalize this to measure the distance between a sample and a set of distributions (a hypothesis), rather than a single distribution or another samples.
For a sample and a hypothesis define
[TABLE]
For , denote the closure of with respect to the topology of .
1 Uniform testing
For , the uniform test is constructed as follows. For each let
[TABLE]
Since the set is a complete separable metric space, it is easy to see that the function is measurable provided is measurable.
Theorem 4.1** (uniform testing).**
*Let be measurable subsets of . If for every then the test is uniformly consistent. Conversely, if there exists a uniformly consistent test for against then for any . *
The proof is deferred to section 5.
The following corollary, which is easy to see already for i.i.d. distributions (see Section 3), for the general case is an immediate consequence of the second statement of the theorem above.
Corollary 4.2**.**
*There is no uniformly consistent test for any hypothesis against its complement unless one of these hypotheses is empty. *
2 Asymmetric testing
Construct the asymmetric test as follows. For each , and define the neighbourhood of -tuples around as
[TABLE]
Moreover, let
[TABLE]
be the smallest radius of a neighbourhood around that has probability not less than with respect to any process in , and let be the neighbourhood of this radius. Define
[TABLE]
Again, it is easy to see that the function is measurable, since the set is separable.
Theorem 4.3**.**
*Let be measurable subsets of . If for every then the test is asymmetrically consistent. Conversely, if there is an asymmetrically consistent test for against then for any . *
For the case when is the complement of the necessary and sufficient conditions of Theorem 4.3 coincide and give the following criterion.
Corollary 4.4**.**
Let be measurable and let . The following statements are equivalent:
- (i)
There exists an asymmetrically consistent test for against .
- (ii)
The test is asymmetrically consistent.
- (iii)
The set has probability 0 with respect to the ergodic decomposition of every in the closure of : for each .
{svgraybox}
There exists an asymmetrically (-level) consistent test for a hypothesis against its complement if and only if is closed and closed under taking ergodic decompositions, in the sense that for every in the closure of .
5 Proofs
In the proofs, we often omit the subscript from when it can cause no confusion.
The proofs use the following lemmas.
Lemma 5.1** (smooth probabilities of deviation).**
Let , , , and . Then
[TABLE]
where with being the sum of all the weights of tuples longer than in the definition of : . Further,
[TABLE]
The meaning of this lemma is as follows. For any word , if it is far away from (or close to) a given distribution (in the empirical distributional distance), then some of its shorter subwords are far from (close to) too. In other words, for a stationary distribution , it cannot happen that a small sample is likely to be close to , but a larger sample is likely to be far.
Proof 5.2**.**
Let be a tuple such that and be any sample of size . The number of occurrences of in can be bounded by the number of occurrences of in subwords of of length as follows:
[TABLE]
Indeed, summing over the number of occurrences of in all we count each occurrence of exactly times, except for those that occur in the first and last symbols. Dividing by , and using the definition (1), we obtain
[TABLE]
Summing over all , for any , we get
[TABLE]
where in the right-hand side corresponds to all the summands in the left-hand side for which , where for the rest of the summands we used . Since this holds for any , we conclude that
[TABLE]
Note that the . Therefore, for the average in the r.h.s. of (6) to be larger than , at least summands have to be larger than .
Using stationarity, we can conclude
[TABLE]
proving (2). The second statement can be proven similarly; indeed, analogously to (4) we have
[TABLE]
where we have used . Summing over different , we obtain (similar to (5)),
[TABLE]
*(since the frequencies are non-negative, there is no term here). For the average in (7) to be smaller than , at least half of the summands must be smaller than . Using stationarity of , this implies (3). *
Lemma 5.3**.**
*Let , be a sequence of processes that converges to a process . Then, for any and if for infinitely many indices , then *
Proof 5.4**.**
*The statement follows from the fact that is continuous as a function of . *
Proof 5.5** (of Theorem 4.3.).**
To establish the first statement of Theorem 4.3, we have to show that the family of tests is consistent. By construction, for any we have .
To prove the consistency of , it remains to show that
[TABLE]
for any and . To do this, fix any and let
[TABLE]
Since , we have . Suppose that there exists an , such that, for infinitely many , some samples from the -neighbourhood of -samples around are sorted as by , that is, . Then for these we have .
This means that there exists an increasing sequence , and a sequence , , such that
[TABLE]
Using Lemma 5.1, (2) (with , , , and ), and taking large enough to have , for every large enough to have , we obtain
[TABLE]
Thus,
[TABLE]
Since the set is compact (as a closed subset of a compact set ), we may assume (passing to a subsequence, if necessary) that converges to a certain . Since (9) this holds for infinitely many , using Lemma 5.3 (with ) we conclude that
[TABLE]
Since the latter inequality holds for infinitely many indices we also have
[TABLE]
However, we must have for every : indeed, for it follows from Lemma 1.2, and for from Lemma 1.2, ergodic decomposition and the conditions of the theorem ( for ).
This contradiction shows that for every there are not more than finitely many for which . To finish the proof of the first statement, it remains to note that, as follows from Lemma 1.2,
[TABLE]
To establish the second statement of Theorem 4.3 we assume that there exists a consistent test for against , and we will show that for every . Take and suppose that
[TABLE]
We have
[TABLE]
*where the inequality follows from Fatou’s lemma (the functions under integral are all bounded by 1), and the equality from the consistency of . Thus, from some on we will have . Taking into account (10), we conclude . For any set the function is continuous as a function of . In particular, it holds for the set . Therefore, since , for any large enough we can find a such that , which contradicts the consistency of . Thus, , and Theorem 4.3 is proven. *
Proof 5.6** (of Theorem 4.1.).**
To prove the first statement of the theorem, we will show that the test is a uniformly consistent test for against (and hence for against ), under the conditions of the theorem. Suppose that, on the contrary, for some for every there is a process such that for some . Define
[TABLE]
which is positive since and are closed and disjoint. We have
[TABLE]
This implies that either
[TABLE]
or
[TABLE]
so that, by assumption, at least one of these inequalities holds for infinitely many for some sequence . Suppose that it is the first one, that is, there is an increasing sequence , and a sequence , such that
[TABLE]
The set is compact, hence so is its closed subset . Therefore, the sequence , must contain a subsequence that converges to a certain process . Passing to a subsequence if necessary, we may assume that this convergent subsequence is the sequence , itself.
Using Lemma 5.1, (2) (with , , , and ), and taking large enough to have , for every large enough to have , we obtain
[TABLE]
That is, we have shown that for any large enough index the inequality holds for infinitely many indices . From this and Lemma 5.3 with we conclude that . The latter holds for infinitely many ; that is, infinitely often. Therefore,
[TABLE]
However, we must have
[TABLE]
for every : indeed, for it follows from Lemma 1.2, and for from Lemma 1.2, ergodic decomposition and the conditions of the theorem.
Thus, we have arrived at a contradiction that shows that cannot hold for infinitely many for any sequence of . Analogously, we can show that cannot hold for infinitely many for any sequence of . Indeed, using Lemma 5.1, equation (3), we can show that for a large enough implies for a smaller . Therefore, if we assume that for infinitely many for some sequence of , then we will also find a for which for infinitely many , which, using Lemma 1.2 and ergodic decomposition, can be shown to contradict the fact that .
Thus, returning to (11), we have shown that from some on there is no for which holds true. The statement for can be proven analogously, thereby finishing the proof of the first statement.
To prove the second statement of the theorem, we assume that there exists a uniformly consistent test for against , and we will show that for every . Indeed, let , that is, suppose that there is a sequence such that . Assume and take . Since the test is uniformly consistent, there is an such that for every we have
[TABLE]
*Recall that, for , is a continuous function in . In particular, this holds for the set , for any given . Therefore, for every and for every large enough, implies also which contradicts . This contradiction shows for every . The case is analogous. *
6 Examples
Theorems 4.3 and 4.1 can be used to check whether a consistent test exists for such problems as identity, independence, estimating the order of a (Hidden) Markov model, bounding entropy, bounding distance, uniformity, monotonicity, etc. Some of these examples are considered in this section.
1 Simple hypotheses, identity or goodness-of-fit testing
First of all, it is obvious that sets that consist of just one or finitely many stationary ergodic processes are closed and closed under ergodic decompositions. Thus, they meet the conditions of Theorem 4.1, and so, for any pair of disjoint sets of this type, there exists a uniformly consistent test. (In particular, there is a uniformly consistent test for against iff .)
A more interesting case is identity testing, also known as goodness-of-fit: this problem consists in testing whether a distribution generating the sample obeys a certain given law, versus it does not. Thus, let , and . In such a case there is an asymmetrically consistent test for against : indeed, the conditions of Theorem 4.4 are easily verified. It is worth noting that (asymmetric) identity testing is a classical problem of mathematical statistics, with solutions (e.g. based on Pearson’s statistic) for i.i.d. data (e.g. Lehmann:86 ), and Markov chains Billingsley:61 . For stationary ergodic processes, BRyabko:06b gives an asymmetrically consistent test when has a finite and bounded memory, and Ryabko:103s for the general case of stationary ergodic real-valued processes.
As far as uniform testing is concerned, it is, first of all, clear that, just like in the i.i.d. case (cf. Section 3), for any there is no uniformly consistent test for identity. Indeed, as we have seen (Corollary 4.2), for any non-empty there is no uniformly consistent test for against provided neither hypothesis is non-empty. One might suggest at this point that, as in the i.i.d. case, a uniformly consistent test exists if we restrict to those processes that are sufficiently far from , for example, by introducing some -padding around . However, this is not the case. We can prove an even stronger negative result.
Proposition 6.1**.**
*Let , and let . There is no uniformly consistent test for against . *
The following conclusion can be made from this proposition. {svgraybox} While distributional distance is well-suited for characterizing those hypotheses for which consistent tests exist, it is not suited for formulating the actual hypotheses.
Apparently, a stronger distance is needed for the latter.
Proof 6.2** (of Proposition 6.1).**
*Consider the process on pairs
, such that the distribution of is , the distribution of is and the two components and are independent; in other words, the distribution of is . Consider also a two-state stationary ergodic Markov chain , with two states and , whose transition probabilities are \left(\begin{array}[]{cc}1-p&p\\ q&1-q\end{array}\right), where . The limiting (and initial) probability of the state is and that of the state is . Finally, the process is constructed as follows: if is in the state and otherwise (here it is assumed that the chain generates a sequence of outcomes independently of ). Clearly, for every satisfying the process is stationary ergodic. Let , for all , where is a parameter to be defined shortly. Denote the distribution of the process with parameters . With these parameters, independently of (i.e, the Markov chain underlying spends time in the first state). Find sufficiently small so as to have for all sufficiently large , as is always possible since uniformly in . Thus, for all . However, where is the stationary distribution with and . Therefore, and , so that by Theorem 4.1 there is no uniformly consistent test for against . *
2 Markov and Hidden Markov processes: bounding the order
Let us next consider finite-state Markov and hidden Markov processes.
For any , there is an asymmetrically consistent test of the hypothesis = “the process is Markov of order not greater than ” against . For any , there is an asymmetrically consistent test of =“the process is given by a Hidden Markov process with not more than states” against . Indeed, in both cases (-order Markov, Hidden Markov with not more than states), the hypothesis is a parametric family, with a compact set of parameters, and a continuous function mapping parameters to processes (that is, to the space ). Since the space of stationary processes is compact, Weierstrass theorem then implies that the image of such a compact parameter set is closed (and compact). Moreover, in both cases is closed under taking ergodic decompositions. Thus, by Theorem 4.3, there exists an asymmetrically consistent test.
The problem of estimating the order of a (hidden) Markov process based on sampling had been addressed in a number of works. In the contest of hypothesis testing, asymmetrically consistent tests for against with were given in Anderson:57 , see also Billingsley:61 . The existence of non-uniformly consistent tests (a notion weaker than that of asymmetric consistency) for against , and of against , was established in Kieffer:93 . Asymmetrically consistent tests for against were obtained in BRyabko:06a , while for the formulation above that includes the case of asymmetric testing for against is from Ryabko:121c .
Considering the set of all finite-memory processes, it is easy to see that there is no asymmetrically consistent test for this set against its complement: indeed, , so by Corollary 4.3 there is no test. There is also no asymptotically consistent test for this hypothesis, even though it is possible to construct an estimator of the order of a Markov chain that tends to infinity if the process is not Markov; see Morvai:05 and references.
3 Smooth parametric families
From the discussion in the previous example we can see that the following generalization is valid. Let be a set of processes that is continuously parametrized by a compact set of parameters. If is closed under taking ergodic decompositions, then there is an asymmetrically consistent test for against . In particular, this strengthens the mentioned result of Kieffer:93 , since a stronger notion of consistency is used, as well as a more general class of parametric families is considered.
Clearly, a similar statement can be derived for uniform testing: given two disjoint sets and each of which is continuously parametrized by a compact set of parameters and is closed under taking ergodic decompositions, there exists a uniformly consistent test of against .
4 Homogeneity testing or process discrimination
This problem consists in testing, given two samples and , whether the distributions generating these samples are the same or different. We have considered this problem in details in Section 4 for the case of asymptotic consistency and stationary ergodic distinctions (and -processes), and in Section 3 for the case of asymmetric and uniform consistency and smaller sets of distributions. The results can be summarized in the following table. Here we omit uniform testing in view of Corollary 4.2.
5 Independence
Again, we are given two samples, and . The hypothesis of independence is that the first process is independent from the second: for any and any .
Let be the set of all stationary ergodic processes (on pairs) satisfying this property.
Proposition 6.3**.**
*There is no asymmetrically consistent test for independence (for jointly stationary ergodic samples). *
Proof 6.4**.**
The example is based on the so-called translation process, which is constructed as follows. Fix some irrational and select uniformly at random. For each let (that is, the previous element is shifted by to the right, considering the [0,1] interval looped). The samples are obtained from by thresholding at , i.e. (here can be considered hidden states). This process is stationary and ergodic; besides, it has 0 entropy rate Shields:98 , and this is not the last of its peculiarities.
Take now two independent copies of this process to obtain a pair . The resulting process on pairs, which we denote , is stationary, but it is not ergodic. To see the latter, observe that the difference between the corresponding hidden states remains constant. In fact, each initial state corresponds to an ergodic component of our process on pairs. By the same argument, these ergodic components are not independent. Thus, we have taken two independent copies of a stationary ergodic process, and obtained a stationary process which is not ergodic and whose ergodic components are pairs of processes that are not independent!
*To apply Corollary 4.4, it remains to show that the process we constructed can be obtained as a limit of stationary ergodic processes on pairs. To see this, consider, for each , a process , whose construction is identical to except that instead of shifting the hidden states by we shift them by where are i.i.d. uniformly random on . It is easy to see that in distributional distance, and all are stationary ergodic. Thus, if is the set of all stationary ergodic distributions on pairs, we have found a distribution such that . We can conclude that there is no -level consistent test for against its complement. *
In contrast to the situation with homogeneity testing described in Section 3, testing independence becomes possible if we restrict the processes to be Markov.
Indeed, using the notation of the previous sections, it is easy to see that Theorem 4.3 implies that there exists an asymmetrically consistent test for against , for any given . Analogously, if we confine to Hidden Markov processes of a given order, then asymmetric testing is possible. That is, there exists an an asymmetrically consistent test for against , for any given .
7 Open problems
In spite of rather general results on the existence of tests presented in this chapter, perhaps it would not be an exaggeration to say that the most important questions remain open. This section attempts to precise and summarize these.
1 Relating the notions of consistency
Before delving deeper into problems relating various notions of consistency and generalizing the corresponding results, note that two of the notions of consistency considered, asymmetric (-level) consistency and asymptotic consistency, require a certain convergence to hold with probability 1. Naturally, one could replace this convergence with convergence in probability. Let us call the resulting notion weak asymmetric or asymptotic consistency, and those introduced above let us call strong. While weak consistency indeed appears weaker at first sight, it is easy to see, as Nobel Nobel:06 remarks, that weak asymptotic consistency implies strong asymptotic consistency for the case of i.i.d. or strongly mixing processes. It is similarly easy to verify that the same is true for asymmetric consistency. Moreover, for asymmetric consistency, the criterion given in Corollary 4.4 holds equally well for strong and for weak consistency, so in the case weak and strong asymmetric consistency are equivalent for stationary ergodic distributions as well. This suggests that these notions may be equivalent in general.
Conjecture 7.1** (weak=strong).**
*For stationary ergodic distributions, if there exists a weakly asymmetrically consistent (weakly asymptotically consistent) test, then there exists a strongly consistent asymmetrically (strongly asymptotically consistent) test. *
Passing to the relations between the notions of consistency, it might at first glance seam that asymmetric consistency is rather weak, since one of the errors does not go to zero. However, note that it is fixed at the given level independently of the sample size, and uniformly over , making the resulting notion very strong. In fact, from the discussion on the i.i.d. processes in Section 3, one can see that, for i.i.d. examples, uniform consistency is strictly stronger than asymmetric consistency, and asymmetric consistency is strictly stronger than asymptotic consistency (in terms of the existence of tests). One can conjecture that this is the case for stationary ergodic distributions as well.
Conjecture 7.2** (uniform asymmetric asymptotic consistency).**
*Let . If there exists a uniformly consistent test for against , then there exists an asymmetrically consistent test for this pair of hypotheses. If there exists an asymmetrically consistent test for against , then there exists an asymptotically consistent test for this pair of hypotheses. The opposite implications do not hold. *
Note that the implication “uniform asymptotic consistency” is rather obvious, and it is also obvious that the opposite does not hold. The question is, therefore, about the place of asymmetric consistency in the middle; more precisely, whether the strict inclusion generalises from the i.i.d. to the stationary ergodic case. {svgraybox} It remains open to see whether the relation between the notions of consistency (uniform, asymmetric, asymptotic, weak/strong) that holds for i.i.d. processes carries over to the stationary ergodic case.
2 Characterizing hypotheses for which consistent tests exist
The main open problem that remains is to find necessary and sufficient conditions for the existence of each kind of the tests: uniform, asymmetric, and asymptotic.
Problem 7.3**.**
*Find necessary and sufficient conditions on hypotheses for the existence of (uniformly, asymmetrically, asymptotically) consistent tests. *
The only case for which the presented necessary and sufficient conditions coincide is the case of asymmetric consistency when . It is not known whether the same conditions are necessary and sufficient for general pairs (i.e., when is not necessarily the complement of ). However, the fact that for this case we have an “if and only if” criterion, suggests that the topology of the distributional distance is indeed the right one to consider for such characterisations.
Another important problem is to generalize the results of Chapter 4 to real-valued processes.
Problem 7.4**.**
*Find generalisations of Theorems 4.3, 4.1 to real-valued processes. *
The main difference for the real-valued case is that, in the finite-alphabet case, the distributional distance in the form (4) gives a compact space of distributions. This fact has been relied upon heavily in the proofs of the corresponding theorems. The distributional distance in the form (5) does not result in a compact space of distributions. The general form (2) can give a compact space; indeed, as mentioned in Chapter 1, this is the case if the sets form is a standard basis. However, as Gray:88 mentions, there is no easy constructing of such a basis for the real-valued case, even though such a basis exists. On the other hand, an explicit construction is required in order to speak about distance estimates.
3 Independence testing
Recall the problem of independence from Section 5: given two samples, and , it is required to test whether the process generating the first sample is independent from the one generating the second.
It is interesting to note that for the case of i.i.d. data, the problems of homogeneity testing and independence testing can be reduced to one another. The situation is different for dependent data, as we have seen already for the case of (discrete-state) Markov chains: for these processes, there exists an asymmetric test for independence but not for homogeneity. Moreover, whereas for homogeneity (process discrimination) we have seen in Section 4 that there is no asymptotically consistent test, for independence the question of the existence of such a test remains open.
Thus, we can formulate what is known and what is not known about this problem in the following table, which can be compared to the one about homogeneity testing (Table 1).
Chapter 5 Generalizations
In this chapter we outline a number of generalizations of the results described in this volume. Some of these have already been made, while others present interesting directions for future research.
1 Other distances
The empirical distributional distance on which the results of the previous chapters hinge can be seen as an ordinate way of counting frequencies of everything. One may wonder whether the same theoretical consistency results can be obtained while allowing one to benefit from using some of the more sophisticated tools in the box.
This is, indeed, possible, by considering different distances between processes, and then plugging in their estimates into the same algorithms. Here we try to see what distances can be used and which properties are required. While doing so we are mostly concerned with generalizing the results of Chapters 2 and 3, as the theory of hypothesis testing of Chapter 4 is somewhat more delicate.
Introduce the notation for the -dimensional marginal distribution of a time-series distribution .
1 Distances
Observe that the distributional distance in its more-specified formulations (4) and (5) has the form
[TABLE]
where are summable positive real weights and is a certain distance between -dimensional marginal distributions.
It is easy to see that distances of this form can be consistently estimated, as long as can be consistently estimated for each ; this is formalized in the following statement.
Proposition 1.1** (estimating sum-based distances).**
*Let be a set of process distributions. Let be a series of distances on the spaces of distributions over that are bounded uniformly in , and such that there exists a series of their consistent estimates: a.s., whenever are chosen to generate the sequences. Then the distance given by (1) can be consistently estimated using the estimate . *
Clearly, the distributional distance is an example of a distance in the form (1), and it satisfies the conditions of the proposition with being the set of all stationary ergodic processes. Another example is the telescope distance considered in the next subsection.
2 Telescope distance
The telescope distance, introduced in Ryabko:13red+ , is, in fact, a scheme for defining distances between processes. In order to define the telescope distance, we first start with a metric on distributions on . For two probability distributions and on for some and a set of measurable functions on , one can define the distance
[TABLE]
This metric in its general form has been studied since at least Zolotarev:83 and includes Kolmogorov-Smirnov Kolmogorov:33 and Kantorovich-Rubinstein Kantorovich:57 metrics as special cases. It is measurable under mild conditions; in particular, separability of is sufficient for this. Moreover, it is easy to check that is a metric on the space of probability distributions over if and only if generates .
An example of the sets are the sets of hyperplanes in , .
Based on we can construct a distance between time-series probability distributions. For two time-series distributions and sets of functions on , , we take the between -dimensional marginal distributions of and for each , and sum them all up with decreasing weights.
Definition 1.2** (telescope distance).**
For two processes and and a sequence of sets of functions define the telescope distance
[TABLE]
where , is a sequence of positive summable real weights (e.g., the weights we were using before, ).
The empirical telescope distance is defined as
[TABLE]
It is shown in Ryabko:13red+ that the empirical telescope distance so defined is a consistent estimate of the telescope distance, if the sets are separable sets of indicator function of finite VC dimension. The separability condition comes from Adams:12 where the corresponding uniform convergence result is established.
The main appeal of the telescope distance is that it can be estimated using binary classification methods developed for i.i.d. data. Such methods are abound in the machine learning literature. Thus, the telescope distance allows one to channel these methods for use in problems involving time series, such as clustering and the three-sample problem considered in Chapters 2, 3.
The details of the algorithms, as well as the proofs and experimental results, can be found in Ryabko:13red+ .
3 Distances
A different way to construct a distance between time-series distributions based on their finite-dimensional marginals is to use the supremum instead of summation in (1):
[TABLE]
Some commonly used metrics are defined in the form (4) or have natural interpretations in this form, as the following two examples show.
Definition 1.3** (total variation).**
*For time-series distributions the total variation distance between them is defined as . *
It is easy to see that , so that the total variation distance has the form (4).
For stationary ergodic distributions this distance is not very useful, since it just gives the discrete distance: if and only if . This follows from the fact that any two different stationary ergodic distributions are singular with respect to one another.
Another example of a -distance is the distance, defined in Section 4. To see that it is indeed a -distance, consider the following definition of it, which is equivalent to the previous one (see, e.g. Shields:96 ; Ornstein:90 )
[TABLE]
where is the set of all distributions over generating a pair of samples whose marginal distributions are and correspondingly.
As explained in Section 4, this distance turns out to bee too strong for stationary ergodic processes but still useful for -processes, since it is only possible to construct its consistent estimates for the latter set.
4 Non-metric distances
So far we have been considering distances that constitute a metric on the space of all process distributions, or on the space of stationary process distributions. In particular, they have the property of exactness, that is if and only if . This allowed us to solve such problems as clustering (with respect to distribution), where we cluster together those and only those samples that were generated by the same distribution.
Sometimes a weaker goal may be appropriate. For example, one may wish to distinguish only between distributions that have different single-dimensional means and variances, or some other characteristics. Depending on the characteristics of the processes studied, it may be more or less straightforward to establish the consistency of their empirical estimates. However, if consistent empirical estimates are available, it should be reasonably straightforward to translate the algorithms and the results on clustering and change-point problems to such distances.
5 AMS distributions
A particular instance of non-metric distances described in the previous section are distances between the asymptotic-mean distributions of ergodic (non-stationary) or AMS distributions. For non-stationary distributions, in general, one cannot make any inference about the distribution of any initial segment given just one time series sample, which is the case in all the problems we have considered. However, we can make inference about the asymptotic means. We can thus consider the distance between the asymptotic-mean distributions. It is, in fact, the same distributional distance that we have worked with in this volume, only considered as the distance between asymptotic-mean distributions and not the process distributions themselves. Of course, its empirical estimates simply carry over. Note that, considered as a distance between process distributions, it is not a metric, since we can have for that are different (but have the same asymptotic mean). With this distinction in mind, all the formulations of basic-inference, clustering and change-point problems translate to this this more general setting, with “ergodic” substituted for “stationary ergodic” and “AMS” for “stationary,” and the proofs carry over intact.
2 Piece-wise stationary processes
When dealing with change-point problems (Section 2), we have defined a set of process distributions that can be seen as a generalization of stationary process distributions: piece-wise stationary processes. These are constructed by defining a sequence of integer-valued change points, such as between each two consecutive change points the distribution is stationary (or stationary ergodic).
This kind of construction has been widely studied for more restrictive sets of processes, and mainly for i.i.d. processes, resulting in piece-wise i.i.d. models; see, for example Willems:96 ; Gyorgy:12 and references.
For the stationary ergodic case, we have seen that meaningful inference is possible for finitely many change points and linear-sized (in the total sample size ) segments between change points. While, constrained by the nature of the change-points problems we have considered, we have only dealt with fixed sample size and offline formulations, the distributions can be defined in a similar fashion on infinite sequences. A piece-wise stationary distribution is thus identified with a sequence of stationary distributions and a sequence of change points. A number of inference problems can be formulated about these processes, including versions of the clustering and hypotheses-testing problems considered in this volume. Offline clustering and identity testing appear to be the first interesting problems to explore in this regard.
3 Beyond time series
1 Processes over multiple dimensions
Time series, or discrete-time process distributions that are subject of this volume, can be seen as discrete-coordinate stochastic processes extending to infinity in one dimension. One can also consider discrete-coordinate multi-dimensional stochastic processes. The concept of stationarity and ergodicity can be defined similarly to the single-dimensional case. Thus, for a dimension , one can consider a process indexed by , over the space where is the Borel sigma-algebra. Such processes are simply probability measures over . Stationarity can be defined using shifts along each coordinate . A process measure is called stationary if it is preserved under shifts, that is for all and all Borel . Ergodic theorems can be established for such processes, see, for example, Krengel:85 . This is all one needs to use empirical estimates of the distributional distance, and thus formulate and solve basic-inference as well as clustering problems, similar to how it is done in Sections 3, 2. The construct of the distributional distance appears to be general enough even for some results on hypothesis testing of Chapter 4 to be generalizable to this setting.
Change-point problems morph into something much more complex, as change points become change boundaries. It thus appears interesting to explore what kind of change-point-like problems admit solutions in this more general setting.
2 Infinite random graphs
Another way to generalize time series is to consider infinite random graphs. The necessary probability-theoretic foundations have been laid out in Aldous:07 ; lyons2016probability , while the work Benjamini:12 uses these to introduce the notions and establish some basic facts of the ergodic theory on these spaces. It turns out that the distributional distance is a general enough construction to be ported directly to this more general case, and some of the results of this volume, including Theorem 4.3, can be generalized with little extra work. This is done in the work Ryabko:17gratest , which also outlines a number of interesting research directions that emerge in this area.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Terrence M. Adams and Andrew B. Nobel. Uniform approximation of Vapnik-Chervonenkis classes. Bernoulli , 18(4):1310–1319, 2012.
- 2[2] David Aldous and Russell Lyons. Processes on unimodular random networks. Electron. J. Probab. , 12:no. 54, 1454–1508, 2007.
- 3[3] P.H. Algoet. Universal schemes for prediction, gambling and portfolio selection. The Annals of Probability , 20(2):901–941, 1992.
- 4[4] T. Anderson and L. Goodman. Statistical inference about Markov chains. Ann. Math. Stat. , 28(1):89–110, 1957.
- 5[5] M. Basseville and I.V. Nikiforov. Detection of abrupt changes: theory and application . Prentice Hall information and system sciences series. Prentice Hall, 1993.
- 6[6] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on , pages 442–451. IEEE, 2001.
- 7[7] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In STOC , volume 4, pages 381–390, 2004.
- 8[8] Itai Benjamini and Nicolas Curien. Ergodic theory on stationary random graphs. Electron. J. Probab. , 17:no. 93, 1–20, 2012.
