Asymptotic nonparametric statistical analysis of stationary time series

Daniil Ryabko

arXiv:1904.00173·math.ST·April 2, 2019

Asymptotic nonparametric statistical analysis of stationary time series

Daniil Ryabko

PDF

TL;DR

This paper reviews asymptotic nonparametric statistical methods for stationary time series, highlighting what can and cannot be achieved with stationarity assumptions alone, including clustering, change point detection, and hypothesis testing.

Contribution

It summarizes recent results on the asymptotic consistency of algorithms for stationary time series, clarifying the limits and possibilities of statistical inference under minimal assumptions.

Findings

01

Certain problems like homogeneity are impossible to solve under stationarity alone.

02

Algorithms for clustering and change point detection can be asymptotically consistent.

03

A topological criterion for the existence of consistent tests is proposed.

Abstract

Stationarity is a very general, qualitative assumption, that can be assessed on the basis of application specifics. It is thus a rather attractive assumption to base statistical analysis on, especially for problems for which less general qualitative assumptions, such as independence or finite memory, clearly fail. However, it has long been considered too general to allow for statistical inference to be made. One of the reasons for this is that rates of convergence, even of frequencies to the mean, are not available under this assumption alone. Recently, it has been shown that, while some natural and simple problems such as homogeneity, are indeed provably impossible to solve if one only assumes that the data is stationary (or stationary ergodic), many others can be solved using rather simple and intuitive algorithms. The latter problems include clustering and change point estimation. In…

Tables2

Table 1. Table 1: Existence of a consistent test for the hypothesis of homogeneity against its complement, for different notions of consistency and classes of processes

	I.i.d.	Markov	Stationary ergodic
Asymmetric consistency	Test exists	No test	No test
Asymptotic consistency	Test exists	Test exists	No test (Theorem 4.2)

Table 2. Table 2: Existence of a consistent test for the hypothesis of independence against its complement, for different notions of consistency and classes of processes. The differences with homogeneity testing (Table 1 ) are marked in bold.

	I.i.d.	Markov	Stationary ergodic
Asymmetric consistency	Test exists	Test exists	No test (Proposition 6.3)
Asymptotic consistency	Test exists	Test exists	Open question

Equations221

X_{1}, \dots, X_{k}, X_{k + 1}, \dots, X_{n}

X_{1}, \dots, X_{k}, X_{k + 1}, \dots, X_{n}

\nu({\bf x},B):=\left\{\begin{array}[]{rl}{1\over n-k+1}\sum_{i=1}^{n-k+1}I_{\{(X_{i},\dots,X_{i+k-1})\in B\}}&\text{ if }n\geq k,\\ 0&\text{ otherwise.}\end{array}\right.

\nu({\bf x},B):=\left\{\begin{array}[]{rl}{1\over n-k+1}\sum_{i=1}^{n-k+1}I_{\{(X_{i},\dots,X_{i+k-1})\in B\}}&\text{ if }n\geq k,\\ 0&\text{ otherwise.}\end{array}\right.

\nu\big{(}(0.5,1.5,1.2,1.4,2.1),([1.0,2.0]\times[1.0,2.0])\big{)}=1/2.

\nu\big{(}(0.5,1.5,1.2,1.4,2.1),([1.0,2.0]\times[1.0,2.0])\big{)}=1/2.

ρ (X_{1.. j} \in B) = ρ (X_{i .. i + j - 1} \in B) .

ρ (X_{1.. j} \in B) = ρ (X_{i .. i + j - 1} \in B) .

n \to \infty lim ν (X_{1.. n}, B) = v_{B} .

n \to \infty lim ν (X_{1.. n}, B) = v_{B} .

n \to \infty lim ν (X_{1.. n}, B) = ρ (B) a.s.

n \to \infty lim ν (X_{1.. n}, B) = ρ (B) a.s.

ρ (B) = \int d W_{ρ} (μ) μ (B)

ρ (B) = \int d W_{ρ} (μ) μ (B)

d (ρ_{1}, ρ_{2}) := i = 1 \sum \infty w_{i} ∣ ρ_{1} (B_{i}) - ρ_{2} (B_{i}) ∣.

d (ρ_{1}, ρ_{2}) := i = 1 \sum \infty w_{i} ∣ ρ_{1} (B_{i}) - ρ_{2} (B_{i}) ∣.

w_{k} := 1/ k (k + 1) .

w_{k} := 1/ k (k + 1) .

d (ρ_{1}, ρ_{2}) := k = 1 \sum \infty w_{k} B \in A^{k} \sum ∣ ρ_{1} (B) - ρ_{2} (B) ∣.

d (ρ_{1}, ρ_{2}) := k = 1 \sum \infty w_{k} B \in A^{k} \sum ∣ ρ_{1} (B) - ρ_{2} (B) ∣.

d (ρ_{1}, ρ_{2}) = m, l = 1 \sum \infty w_{m} w_{l} B \in B^{m, l} \sum ∣ ρ_{1} (B) - ρ_{2} (B) ∣.

d (ρ_{1}, ρ_{2}) = m, l = 1 \sum \infty w_{m} w_{l} B \in B^{m, l} \sum ∣ ρ_{1} (B) - ρ_{2} (B) ∣.

\hat{d} (x, y) := i = 1 \sum \infty w_{i} ∣ ν (x, B_{i}) - ν (y, B_{i}) ∣.

\hat{d} (x, y) := i = 1 \sum \infty w_{i} ∣ ν (x, B_{i}) - ν (y, B_{i}) ∣.

\hat{d} (x, ρ) := i = 1 \sum \infty w_{i} ∣ ν (x, B_{i}) - ρ (B_{i}) ∣,

\hat{d} (x, ρ) := i = 1 \sum \infty w_{i} ∣ ν (x, B_{i}) - ρ (B_{i}) ∣,

∣ ν ((X_{1}, \dots, X_{k}), B_{j}) - ρ (B_{j}) ∣ < ε / (4 J w_{j})

∣ ν ((X_{1}, \dots, X_{k}), B_{j}) - ρ (B_{j}) ∣ < ε / (4 J w_{j})

|\hat{d}({\bf x},{\bf y})-d(\rho_{\bf x},\rho_{\bf y})|=\\ \left|\sum_{i=1}^{\infty}w_{i}\big{(}|\nu({\bf x},B_{i})-\nu({\bf y},B_{i})|-|\rho_{\bf x}(B_{i})-\rho_{\bf y}(B_{i})|\big{)}\right|\\ \leq\sum_{i=1}^{\infty}w_{i}\big{(}|\nu({\bf x},B_{i})-\rho_{\bf x}(B_{i})|+|\nu({\bf y},B_{i})-\rho_{\bf y}(B_{i})|\big{)}\\ \leq\sum_{i=1}^{J}w_{i}\big{(}|\nu({\bf x},B_{i})-\rho_{\bf x}(B_{i})|+|\nu({\bf y},B_{i})-\rho_{\bf y}(B_{i})|\big{)}+\varepsilon/2\\ \leq\sum_{i=1}^{J}w_{i}(\varepsilon/(4Jw_{i})+\varepsilon/(4Jw_{i}))+\varepsilon/2=\varepsilon,

|\hat{d}({\bf x},{\bf y})-d(\rho_{\bf x},\rho_{\bf y})|=\\ \left|\sum_{i=1}^{\infty}w_{i}\big{(}|\nu({\bf x},B_{i})-\nu({\bf y},B_{i})|-|\rho_{\bf x}(B_{i})-\rho_{\bf y}(B_{i})|\big{)}\right|\\ \leq\sum_{i=1}^{\infty}w_{i}\big{(}|\nu({\bf x},B_{i})-\rho_{\bf x}(B_{i})|+|\nu({\bf y},B_{i})-\rho_{\bf y}(B_{i})|\big{)}\\ \leq\sum_{i=1}^{J}w_{i}\big{(}|\nu({\bf x},B_{i})-\rho_{\bf x}(B_{i})|+|\nu({\bf y},B_{i})-\rho_{\bf y}(B_{i})|\big{)}+\varepsilon/2\\ \leq\sum_{i=1}^{J}w_{i}(\varepsilon/(4Jw_{i})+\varepsilon/(4Jw_{i}))+\varepsilon/2=\varepsilon,

\hat{d} (x, y) := k = 1 \sum k_{n} w_{k} B \in A^{k} \sum ∣ ν (x, B) - ν (y, B) ∣,

\hat{d} (x, y) := k = 1 \sum k_{n} w_{k} B \in A^{k} \sum ∣ ν (x, B) - ν (y, B) ∣,

\hat{d} (x, y) := m = 1 \sum m_{n} l = 1 \sum l_{n} w_{m} w_{l} B \in B^{m, l} \sum ∣ ν (x, B) - ν (y, B) ∣.

\hat{d} (x, y) := m = 1 \sum m_{n} l = 1 \sum l_{n} w_{m} w_{l} B \in B^{m, l} \sum ∣ ν (x, B) - ν (y, B) ∣.

T^{m, l} := B \in B^{m, l} \sum ∣ ν (X_{1.. n_{1}}, B) - ν (Y_{1.. n_{2}}, B) ∣

T^{m, l} := B \in B^{m, l} \sum ∣ ν (X_{1.. n_{1}}, B) - ν (Y_{1.. n_{2}}, B) ∣

s = X_{i} \neq = Y_{j} i = 1.. n_{1}, j = 1.. n_{2} min ∣ X_{i} - Y_{j} ∣,

s = X_{i} \neq = Y_{j} i = 1.. n_{1}, j = 1.. n_{2} min ∣ X_{i} - Y_{j} ∣,

l = 1 \sum \infty w_{m} w_{l} T^{m, l} = w_{m} w_{l o g s^{- 1}} T^{m, l o g s^{- 1}} + l = 1 \sum l o g s^{- 1} w_{m} w_{l} T^{m, l}

l = 1 \sum \infty w_{m} w_{l} T^{m, l} = w_{m} w_{l o g s^{- 1}} T^{m, l o g s^{- 1}} + l = 1 \sum l o g s^{- 1} w_{m} w_{l} T^{m, l}

L({\bf x},{\bf y},{\bf z}):=\left\{\begin{array}[]{rl}\text{ ``x'' }&\text{ if }\hat{d}({\bf x},{\bf z})\leq\hat{d}({\bf y},{\bf z})\\ \text{``y''}&\text{ otherwise, }\end{array}\right.

L({\bf x},{\bf y},{\bf z}):=\left\{\begin{array}[]{rl}\text{ ``x'' }&\text{ if }\hat{d}({\bf x},{\bf z})\leq\hat{d}({\bf y},{\bf z})\\ \text{``y''}&\text{ otherwise, }\end{array}\right.

L (x, y, z) = “x”

L (x, y, z) = “x”

L (x, y, z) = “y”

L (x, y, z) = “y”

\hat{d} (y, z) \to d (ρ_{y}, ρ_{z}) \neq = 0.

\hat{d} (y, z) \to d (ρ_{y}, ρ_{z}) \neq = 0.

\lim_{n\rightarrow\infty}\mathbb{E}D_{n}((X_{1},\dots,X_{n}),(Y_{1},\dots,Y_{n}))=\left\{\begin{array}[]{ll}0&\text{ if $\rho_{\bf x}=\rho_{\bf y}$,}\\ 1&\text{ otherwise. }\end{array}\right.

\lim_{n\rightarrow\infty}\mathbb{E}D_{n}((X_{1},\dots,X_{n}),(Y_{1},\dots,Y_{n}))=\left\{\begin{array}[]{ll}0&\text{ if $\rho_{\bf x}=\rho_{\bf y}$,}\\ 1&\text{ otherwise. }\end{array}\right.

ν_{x y} (X_{1} \neq = Y_{1}) \leq ε .

ν_{x y} (X_{1} \neq = Y_{1}) \leq ε .

n \to \infty lim \overset{s}{ˉ}_{n} ((X_{1}, \dots, X_{n}), (Y_{1}, \dots, Y_{n})) = \overset{ˉ}{d} (ρ_{1}, ρ_{2}) ρ_{1} \times ρ_{2} -a.s.

n \to \infty lim \overset{s}{ˉ}_{n} ((X_{1}, \dots, X_{n}), (Y_{1}, \dots, Y_{n})) = \overset{ˉ}{d} (ρ_{1}, ρ_{2}) ρ_{1} \times ρ_{2} -a.s.

\lim_{n\rightarrow\infty}\mathbb{E}D_{n}((X_{1},\dots,X_{n}),(Y_{1},\dots,Y_{n}))=\left\{\begin{array}[]{ll}0&\text{ if $\rho_{\bf x}(X_{1}=0)=\rho_{\bf y}(Y_{1}=0)$,}\\ 1&\text{ otherwise. }\end{array}\right.

\lim_{n\rightarrow\infty}\mathbb{E}D_{n}((X_{1},\dots,X_{n}),(Y_{1},\dots,Y_{n}))=\left\{\begin{array}[]{ll}0&\text{ if $\rho_{\bf x}(X_{1}=0)=\rho_{\bf y}(Y_{1}=0)$,}\\ 1&\text{ otherwise. }\end{array}\right.

E_{ρ_{0} \times ρ_{0}} D_{t_{0}} ((X_{1}, \dots, X_{t_{0}}), (Y_{1}, \dots, Y_{t_{0}})) < ε,

E_{ρ_{0} \times ρ_{0}} D_{t_{0}} ((X_{1}, \dots, X_{t_{0}}), (Y_{1}, \dots, Y_{t_{0}})) < ε,

E_{ρ_{u 1} \times ρ_{d 1}} D_{k} ((X_{1}, \dots, X_{t_{1}}), (Y_{1}, \dots, Y_{t_{1}})) > 1 - ε,

E_{ρ_{u 1} \times ρ_{d 1}} D_{k} ((X_{1}, \dots, X_{t_{1}}), (Y_{1}, \dots, Y_{t_{1}})) > 1 - ε,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asymptotic nonparametric statistical analysis of stationary time series

Daniil Ryabko

This book is about making statistical inference from stationary discrete-time processes. The assumption of stationarity alone is often considered too weak to make any meaningful inference. Here this view is challenged by showing that, while some rather basic problems indeed can be proven not to admit any solution in this setting, surprisingly many are solvable without any further assumptions. These includes such complex problems as clustering and change-point analysis. Some general results characterizing those problems that admit a solution are also presented.

The material in this volume is presented in a way that presumes familiarity with basic concepts of probability and statistics, up to and including probability distributions over spaces of infinite sequences. All the required background material can be found in the excellent monograph Gray:88 , which also contains a much deeper exposition of some of the key concepts used here, such as the distributional distance. Familiarity with ergodic theory is not required for understanding the material exposed in the present volume. Indeed, with two exceptions, the proofs do not rely on any facts deeper than the convergence of frequencies. One exception is Chapter 4, which deals with hypothesis testing and provides a characterisation of hypotheses for which consistent tests exist; the required background material for this chapter can be found in Chapter 1. The other exception is Section 4, which establishes impossibility of discrimination between process distributions; this section is self-contained. The reader who is familiar with ergodic theory and feels the exposition in this volume is somewhat unorthodox, can find all the necessary links to the more familiar framework in Shields:96 ; the latter book is also recommended to anyone seeking a deeper understanding of such results as the slow convergence of frequencies and entropy estimates, the classic ergodic theorem and much more.

This book is organized as follows. Chapter Asymptotic nonparametric statistical analysis of stationary time series is introductory: besides providing some motivation for studying the problems addressed, it also introduces in an informal manner the main concepts used and the main results presented. Chapter 1 introduces the notation and definitions used in the subsequent chapters, as well as some necessary background material. Chapter 2 considers the most basic problems of statistical inference, on which the rest of the volume builds: estimating a distance between processes (the distributional distance) and the problem of homogeneity testing or process discrimination, which, crucially for the subsequent problems addressed, is shown to be impossible to solve in the general setting of this book. Chapter 3 is devoted to clustering and change-point problems, which can be solved, or, in some cases, can be shown to admit no solution, based on the result of the preceding chapter. Chapter 4 addresses the problems of hypotheses testing in the general form: studying which pairs of hypotheses admit a consistent test. Finally, Chapter 5 discusses various generalizations of the presented results, as well as some directions for future research.

Acknowledgements

Thanks to Léon Bottou for giving me the idea to write a book on this subject and for encouraging me to do it. Thanks to Boris Ryabko and Azadeh Khaleghi, in collaboration with whom some of the results presented here were obtained.

Santa Cruz de la Sierra Daniil Ryabko

0 Introduction
1 Stationarity, ergodicity, AMS
2 What is possible and what is not possible to infer from stationary processes
3 Overview of the inference problems covered
1 Preliminaries
1 Stationarity, ergodicity
2 Distributional distance
2 Basic inference
1 Estimating the distance between processes and reconstructing a process
2 Calculating $\hat{d}$
3 The three-sample problem
4 Impossibility of discrimination
1 Setup and definitions
2 The main result
3 Clustering and change-point problems
1 Time-series clustering
1 Problem formulation
2 A clustering algorithm and its consistency
3 Extensions: unknown $k$ , online clustering and clustering with respect to independence
Unknown number of clusters
Online clustering
Clustering with respect to independence
2 Change-point problems
1 Single change point
2 Multiple change points, known number of change points
3 Unknown number of change points
Listing change points
Known number of distributions, unknown number of change points
4 Hypothesis testing
1 Introduction
1 Motivation and examples
2 Types of consistency
1 Uniform consistency
2 Asymmetric consistency
3 Asymptotic consistency
4 Other notions of consistency
3 One example that explains hypotheses testing
1 Bernoulli i.i.d. processes
2 Markov chains
3 Stationary ergodic processes
4 Topological characterizations
1 Uniform testing
2 Asymmetric testing
5 Proofs
6 Examples
1 Simple hypotheses, identity or goodness-of-fit testing
2 Markov and Hidden Markov processes: bounding the order
3 Smooth parametric families
4 Homogeneity testing or process discrimination
5 Independence
7 Open problems
1 Relating the notions of consistency
2 Characterizing hypotheses for which consistent tests exist
3 Independence testing
5 Generalizations
1 Other distances
1 $\operatorname{sum}$ Distances
2 Telescope distance
3 $\operatorname{sup}$ Distances
4 Non-metric distances
5 AMS distributions
2 Piece-wise stationary processes
3 Beyond time series
1 Processes over multiple dimensions
2 Infinite random graphs

Chapter 0 Introduction

This book is about making statistical inference from discrete-time processes under what is perhaps the weakest of statistical assumptions: stationarity. Before embarking on this journey, it is worth asking the question of why it is interesting to study statistical problems under this assumption alone, or under similar related assumptions. To answer this question, one should first consider what it means to have a good set of assumptions, or a good model, for a statistical problem at hand.

Choosing the right assumptions presents the following trade-off. On the one hand, making strong assumptions makes the inference task easier and allows one to obtain stronger performance guarantees for the algorithms developed. For example, by assuming that the data are independent and identically distributed (i.i.d.), one gets at one’s disposal an extremely versatile statistical toolkit that is a result of centuries of research on this model. With this, it is possible to obtain sharp bounds on error probabilities of the resulting methods. Even stronger results can be obtained if one further makes parametric assumptions. On the other hand, all such results are useless if the assumptions made do not hold for the data at hand. Of course, one can try to apply a statistical test to the data in order to verify the validity of one or another model. This, however, only pushes back the problem, because to use a test one needs to make another set of assumptions, called the alternative. Indeed, it is not possible to test, based on data, that the assumption $H_{0}$ holds versus it does not hold. For example, it is not possible to test that the data are Gaussian i.i.d. versus the distribution of the data is anything else except Gaussian i.i.d. This is because the alternative “anything else” is too general and includes, for example, such distributions as the one that is concentrated precisely on the data available. It is, however, possible to design a test for the hypothesis “the data are Gaussian i.i.d.” versus “the data are i.i.d. but not Gaussian” or “the data are i.i.d.” versus “the distribution of the data is stationary.” In other words, it may be possible to test a set of assumptions $H_{0}$ versus an alternative set of assumptions $H_{1}$ . The latter is typically much more general; in fact, one is interested in making it as general as possible. Nonetheless, the alternative hypothesis is still a set of assumptions.

And so we are back to the question of how one can select a model or a set of assumptions for the data one has. Here we need to admit that this question brings us outside of the realm of mathematics. The answer is simply that one should make assumptions that one can reasonably expect to hold based on the specifics of the target application. Thus, the assumptions should be qualitative, natural and simple — utterly unmathematical terms, but such is the problem. Otherwise, there is little hope to be able to say whether the model is adequate for any given application. A good example are assumptions based on independence. Indeed, this must be one of the reasons why independent and identically distributed data are so widely studied: it is often possible to tell whether the application produces data that are independent or that are not independent. Other models that are based on independence are Markov chains and, more generally, Bayesian networks.

Unfortunately, there are not many alternatives to independence-based models. Thus, if the data are utterly and completely dependent, as perhaps are most of the data in the world, a statistician is a bit short of options. A common generalisation to resort to in such cases are various mixing assumptions. These allow one to extend the tools and methods developed for i.i.d. data to the cases of carefully constrained dependence. However, mixing assumptions are neither verifiable against a general alternative (such as stationarity) nor, to say the least, are easy to asses informally from the data.

Stationarity is perhaps the only general non-parametric model that is not based on independence, and which is also qualitative, natural and simple to assess from data. Next we take a brief and informal look at stationarity and associated concepts.

1 Stationarity, ergodicity, AMS

Very informally, assuming that the data are stationary means assuming that the time index itself bears no information. Thus, it does not matter whether the data we see are $X_{0},X_{1},\dots$ or they are in fact $X_{100},X_{101},\dots$ . I.i.d. data obviously satisfy this assumption, as do, with some minor tweaks to be discussed below, most other models in wide use, such as Markov chains. Thus, stationarity may be used as an alternative hypothesis for testing other models. It is also suited for the cases when one knows next to nothing about the data, and thus wishes to make as few assumptions as possible. In fact, the assumption is so general that one wonders whether any inference is possible under stationarity alone. Indeed, if any inference is possible at all it is due the the associated property of ergodicity.

A process is ergodic if the frequency of every finite-time event almost surely converges to a constant. Thus, for binary-valued processes, the frequency of any word, such as 0, 01, or 011010, converges to some constant. We cannot say anything about the speed of this convergence, but the asymptotic property is already enough to make inference. The ergodic decomposition theorem establishes that every stationary process is a mixture of processes that are stationary and ergodic. Thus, a stationary process can be thought of as, first, before we start observing the data, drawing a stationary ergodic process (according to some prior distribution over such processes) and then using this stationary ergodic process to generate the data. To put it simpler: whenever we observe a stationary process, we observe, in fact, a stationary ergodic process. Thus, for most practical as well as many theoretical considerations, a stationary process is a stationary ergodic process.

Note that an ergodic process does not have to be stationary. A good example of an ergodic non-stationary process is a finite-state connected Markov chain with an initial distribution on the states that is different from the stationary distribution. Asymptotically, this process is equivalent to the Markov chain with the initial distribution taken to be the stationary distribution. One can take mixtures of ergodic process, obtaining processes that are called asymptotically mean stationary or AMS. An AMS process is such that the frequencies of all finite-time events converge almost surely (but not necessarily to a constant). Since the definition of an ergodic process only involves its asymptotic properties, all the inference one can make about such processes concern their asymptotic behaviour. In this (asymptotic) sense, similar to stationary processes, an AMS process can be thought of, very roughly, as first drawing an ergodic process (according to some prior distribution over such processes) and then using this ergodic process to generate the data. Again, for most purposes AMS processes are ergodic processes. In turn, ergodic processes are a certain generalisation of stationary ergodic processes: as in the Markov-chain example, they are equivalent in asymptotic. Another example of can think of is taking a realization of a stationary ergodic process and adding some arbitrary prefix to it; or doing to it anything else that does not affect asymptotic frequencies.

It is worth emphasizing that, with the exception of stationarity itself, all the definitions we are using only tell us something about asymptotic properties of a process; moreover, for the purposes of statistical inference that we shall be exploring, stationarity can only be used in conjunction with or via ergodicity (which, fortunately, can always be presumed via the ergodic decomposition theorem mentioned). Therefore, any results we should expect shall also be about asymptotic properties of the algorithms that we shall construct.

Thus, there will be little difference for us in the course of this volume between ergodic processes and stationary ergodic processes, and between stationary processes and AMS processes. In fact, most of the results of this book do not require any other assumption than AMS or ergodicity. Thus, they can be thought of as answering the question:

{svgraybox}

Stationarity-based statistical inference: What statistical inference can one make under the only assumption that frequencies converge, without any guarantees on the speed of this convergence?

One exception is Chapter 4, where we do need our processes to be stationary, and use some deeper results of ergodic theory. The other exception is the impossibility result concerning process discrimination (along with its implications) which applies to an even smaller class of process; since it is an impossibility result, this makes it stronger.

The main difference between the problems of statistical inference addressed in this volume and those studied in the vast majority of statistical literature is the lack of any guarantees on the speed of convergence that one can use.

In contrast, independence-based methods rely heavily on concentration-of-measure results that are used to bound the speed of convergence and, consequently, open the possibility to obtain finite-time bounds on the error of the resulting algorithms. In fact, the (conditional) independence assumptions are typically not used directly but rather through concentration of measure results. Mixing assumptions provide a generalisation that allows one to forego independence but still use the corresponding speed of convergence guarantees. Thus, one can think of independence-based models and their generalizations as studying the following general question:

{svgraybox}

Independence-based statistical inference: What statistical inference can one make under the assumption that frequencies converge and the speed of this convergence can be bounded?

We shall see in this book that the difference between these two general questions is smaller than one might think, but sometimes the contrast between what is possible and what is not possible to do without any speeds of convergence is rather striking and even counter-intuitive.

2 What is possible and what is not possible to infer from stationary processes

It appears that, with the exception of the problem of probability forecasting to be mentioned below in this section, the prevailing view in the literature is that assuming only that a process is stationary and ergodic is not enough to make statistical inference. This view may stem in part from the rather influential 1990 paper Ornstein:90 by Ornstein and Weiss. This paper is full of deep and insightful results about $B$ -processes, which is a set of processes smaller than that of stationary ergodic processes, but is rather dismissive of the general case. In particular, it makes statements such as “In general, one cannot hope to guess the long-term behaviour from finite information” (referring to the non- $B$ case); “If a totally ergodic process is not $B$ , then it cannot be approximated arbitrarily well by $k$ -step Markov processes.” The work Ornstein:90 goes further in this direction when it considers the problem of discrimination between two processes. This problem, also know as homogeneity testing, consists in telling, given two finite samples whose length, in this setting, is allowed to grow to infinity, whether they were generated by the same or different process distributions. It is stated in Ornstein:90 that, outside of the class of $B$ -processes, even this simple “yes-no” question of “same-different” cannot be answered in an effective way. However, the example used to demonstrate this statement only shows that it is not possible to estimate a certain distance, called $\bar{d}$ distance, between stationary ergodic processes that are not $B$ . This is a rather different statement, and a one made about a different problem: indeed, in order to answer the “same-different” question, one might try to estimate any other distance or, more generally, use any algorithm whatsoever. Thus, the statement made in Ornstein:90 about the problem of discrimination can be at most considered a conjecture. The distance is also crucial to understanding the previous statements made: it is not possible to approximate a stationary ergodic process with $k$ -step Markov processes in $\bar{d}$ -distance, or to construct any other estimate of such a process that would be asymptotically consistent in terms of this distance.

The picture changes dramatically if we change the distance between processes that we are trying to estimate. As we are going to see in this volume, using a different distance, it is possible to construct asymptotically consistent estimates of the distribution of an arbitrary stationary ergodic process, as well as to solve a variety of other interesting statistical problems. The distance we are going to use is well known, but had somehow remained largely unused. Gray Gray:88 calls it distributional distance, and this is the name we shall use here, despite its apparent ambiguity: indeed, it may seem to refer to any distance between distributions. As for the problem of discrimination between process distributions, it turns out that indeed, as conjectured by Ornstein and Weiss Ornstein:90 , it does not admit a solution if we only assume that the distributions are stationary ergodic. Interestingly, the same impossibility result holds for the smaller class of $B$ processes as well, for which it is possible to estimate the $\bar{d}$ distance, as shown in the same work Ornstein:90 . Thus, no amount of data may be sufficient to answer the simple “same-different” question about two process distributions. This result is formally demonstrated in Section 4.

Since these two problems, distance estimation and discrimination between processes, are crucial for the development of the material presented here, let us look at them at some more detail.

Recall that one distance (or a metric — all distances considered in this volume are metrics unless stated otherwise) is weaker, in the topological sense, than another, if every sequence111Here we are only concerned with separable metric spaces. that converges in the former converges in the latter, but the opposite does not necessarily hold. Thus, it is “easier” for a sequence to converge in a weaker distance, which makes it easier to construct a sequence of estimates of a process that converges to this process. Likewise, given two data sequences, a weaker distance between the process distributions that generates these sequences is easier to estimate. The distributional distance is weaker, in the topological sense, than the $\bar{d}$ -distance.222To make complete sense of this sentence, we would need to define the distances formally first, which is done in the next chapter. We shall see that the definition of the distributional distance is ambiguous: it depends on a set of parameters, changing which may change the resulting topology. However, it is possible to make this statement formally correct. It is thus reasonable to expect that the former can be estimated for a larger class of processes than the latter. Indeed, as is shown in this volume, the distributional distance can be estimated for stationary ergodic processes, while, as is shown in Ornstein:90 , $\bar{d}$ -distance can be estimated for the smaller set of $B$ -processes but not for stationary ergodic processes. The strongest possible distance is the discrete 0-1 distance, which takes the value 0 if and only if two distributions are the same and 1 otherwise. It is this distance that we are trying to estimate when answering the “same-different” question of process discrimination. It thus should be of no surprise that it is not possible to estimate it even for $B$ -processes, even though it is possible to estimate it for smaller classes, such as, for example, i.i.d. processes. For many different problems, however, it is enough to have consistent estimates of at least some distance between process distributions, and thus it makes sense to prefer weaker distances, since this allows one to consider wider sets of processes. We shall review shortly which problems of inference can be solved using consistent estimates of distributional distance (or, indeed, of any distance between process distributions).

Taking a different look at the problem of process discrimination, one can see that it is linked to another fundamental impossibility result — the impossibility to establish the speed of convergence, say, of frequencies. The way we have defined ergodic processes, as all processes for which frequencies converge a.s. to a constant, makes it evident that this convergence may be arbitrary slow, so there is no guarantee on the speed. It is not so evident that such a guarantee does not exist if we consider the set of all stationary ergodic processes (that is, adding the requirement of stationarity). The proof of the fact that indeed the convergence of frequencies can be arbitrary slow for stationary ergodic processes can be found, for example, in the excellent monograph Shields:96 , which also demonstrates the equivalence of the (unorthodox) definition that we adopt here to the more common one formulated in terms of shift-invariant sets. Imagine now an algorithm that tries to solve the discrimination problem based on (consistent) estimates of some distance. It makes these estimates based on sampels of longer and longer size $n$ . Suppose that these estimates keep approaching 0, let us say, exponentially with $n$ . At some point one should reasonably expect the algorithm to say that the samples were generated by the same distribution. Suppose the estimated distance at this point is $\varepsilon$ . From this point on, imagine that, as the sample size $n$ continues to grow, the estimate does not decrease at all but just stays $\varepsilon$ . Then, at some point, we should expect the algorithm to change its mind and to say that the samples were generated by the same distribution. At which point the estimates start decreasing again. Since there is no guarantee on the speed of convergence (of anything), there is no way to ensure that the behaviour outlined cannot happen. In fact, the proof of the impossibility result is based on constructing, for any algorithm that presumably solves the problem of discrimination (and that may or may not be based on distance estimates), a process that tricks it into changing its mind ad infinitum in this fashion.

More generally, from the discussion above on the absence of speed of convergence guarantees, it should already be clear that:

{svgraybox}

Every algorithm that we may construct shall only have asymptotic performance guarantees in the considered setting. No finite-time bounds on the probability of error are possible.

From the practical point of view this is not in itself a hindrance: what the fact that a result is asymptotic means, in practice, is that it holds when the data samples are large enough.

The only exception, where we do obtain results about what happens at every time step, is hypothesis testing. Here one may wish to invert the question, by asking for which processes distributions can we have a certain level of error at a certain finite time. These questions are considered in Chapter 4.

Having outlined the general framework and the main impossibility results, let us now briefly review the highlights of what is possible to achieve for stationary or stationary and ergodic processes.

Perhaps the one important problem concerning stationary processes that has not been deemed too difficult to solve and thus gained a fair bit of attention in the literature is the problem of prediction or probability forecasting. It consists in forecasting the probability of the next outcome $X_{n+1}$ conditional on the past observations $X_{1},\dots,X_{n}$ , where the sequence $X_{1},\dots,X_{n},X_{n+1},\dots$ is generated by an unknown stationary (ergodic) process distribution. This problem is of great practical importance, not in the least because it is intimately connected to the problem of data compression. Ample literature on this problem and its variations exist, which is why we do not cover it in this volume. This literature goes as far back as Ornstein:78 for the prediction with the growing past problem, and includes BRyabko:88 that solves the forward-prediction problem for finite-alphabet processes, Algoet:92 for real-valued processes, as well as BRyabko:09 ; Morvai:96 ; Morvai:97 ; BRyabko:16 and others.

The problems covered in this volume are outlined in the next section.

3 Overview of the inference problems covered

The first group of problems considered are those that are based directly on estimating a distance between process distributions. Since we have an asymptotically consistent estimator of the distributional distance, we can answer questions of the form: given three samples ${\bf x}=(X_{1},\dots,X_{n})$ , ${\bf y}=(Y_{1},\dots,Y_{m})$ , ${\bf z}=(Z_{1},\dots,Z_{l})$ , say whether the distribution of the process that generates ${\bf z}$ is closer to the distribution of ${\bf x}$ or to the one of ${\bf y}$ . The answer will be correct as long as the samples are long enough (that is, asymptotically correct). Some forms of this problem are known as process classification or the three-sample problem, and this is an example of a problem that we can solve. It generalizes to the problem of clustering: given $N$ samples generated by $k$ different, unknown, stationary ergodic distributions, cluster them into $k$ groups according to the distribution that generates them. Note that this problem can only be solved if $k$ is known. Indeed, the problem of discrimination corresponds to clustering just two samples, but with $k$ unknown (either 1 or 2), and already this case, as we have seen, has no solution.

The next problem to consider is change-point estimation. A sample

[TABLE]

is the concatenation of two samples $X_{1},\dots,X_{k}$ and $X_{k+1},\dots,X_{n}$ generated by different stationary ergodic distributions. It is required to find or to approximate the change point $k$ . This is possible to do with an algorithm that essentially outputs the point that maximizes the estimated distance between what is before and after it in the sample. On the other hand, the related problem of change-point detection, which consists in saying whether the sample is generated by the same distribution or there is a change of distribution somewhere, admits no solution. A generalisation of these problems to the case of multiple change points presents a delicate interplay between what is possible and what is not. We only briefly review the corresponding results in this volume (Section 2), referring the interested reader to the papers that present the full proofs Khaleghi:14 ; Khaleghi:15chp ; Khaleghi:12mchp .

As discussed above, one of the main reasons to study such general models as stationarity is to be able to use them as an alternative hypothesis in order to verify the validity of a smaller model. Thus, one may wish to test a hypothesis $H_{0}$ , which is a subset of the set of all stationary ergodic process distributions, against its complement to this set, or against its different subset. For example, testing $H_{0}=$ “the process is i.i.d.” versus $H_{1}=$ “the process is stationary ergodic and not i.i.d.” As we have seen above, some rather simple hypotheses, such as process discrimination (known in the context of hypotheses testing as the hypothesis of homogeneity: a hypothesis about a pair of processes that states that they have the same distribution) do not admit a consistent test, even in a very week asymptotic sense. Yet, as we shall see, some other hypotheses of practical significance, such as that the process is i.i.d. or that it is Markov, do admit a consistent test against the complement to the set of all stationary ergodic processes. Thus, it appears interesting to study the general question of which hypotheses do and which do not admit a consistent test. This is what we do in Chapter 4. The main result is a topological “if and only if” criterion for the existence of a consistent test of an arbitrary subset of the set of all stationary ergodic processes against its complement. At the same time, a number of important and interesting questions remain open. In particular, this is the only chapter where we restrict the consideration to finite-alphabet processes, leaving the general case open for further research. Some of the interesting open problems related to hypotheses testing are presented in the end of Chapter 4, while some more general ones are deferred to Chapter 5, which is devoted to generalizations.

Chapter 1 Preliminaries

To simplify the exposition, we are considering (stationary ergodic) processes with the alphabet $A=\mathbb{R}$ or, in some cases, a finite set $A$ . The generalization from $A=\mathbb{R}$ to $A=\mathbb{R}^{d}$ is straightforward; moreover, the results can be extended to the case when $A$ is a Polish (complete separable metric) space. The symbol $A^{*}$ is used for $\cup_{i=1}^{\infty}A^{i}$ . Elements of $A^{*}$ are called words or sequences.

Let ${\mathcal{B}}_{n}$ be the Borel sigma-algebra of $A^{n}$ , and ${\mathcal{B}}_{\infty}$ the the Borel sigma-algebra of $A^{\infty}$ . Let also ${\mathcal{B}}=\cup_{n=1}^{\infty}{\mathcal{B}}_{n}$ .

Time-series distributions, processes distributions or simply processes are probability measures on $(A^{\infty},{\mathcal{B}}_{\infty})$ .

We will be speaking about samples, typically denoted ${\bf x},{\bf y}$ or ${\bf z}$ , taking values in $A^{*}$ . This is a short-hand notation for expressions like $X_{1..n}=(X_{1},\dots,X_{n})$ or $Y_{1..k}=(Y_{1},\dots,Y_{k})$ where $n=|{\bf x}|$ and $k=|{\bf y}|$ are lengths of the samples. The samples that we shall be considering are to be generated by process distributions, usually stationary or stationary ergodic, typically denoted $\rho$ or $\rho_{\bf x}$ , $\rho_{\bf y}$ (or other Greek letters) to make clear which sample they generate. This means that, say, $\rho_{\bf x}$ is a (stationary ergodic) probability distribution over $(A^{\infty},{\mathcal{B}}_{\infty})$ , and thus we are speaking about an $A^{\infty}$ -valued random variable $(X_{1},\dots,X_{n},\dots)$ of which $X=(X_{1},\dots,X_{n})$ is the initial segment of length $n$ .

Definition 0.1.

For a sequence ${\bf x}=X_{1..n}$ taking values in $A^{n}$ and a measurable $B\subset A^{k}$ with $k\in\mathbb{N}$ denote $\nu({\bf x},B)$ the frequency with which the sequence ${\bf x}$ falls in the set $B$

[TABLE]

For example,

[TABLE]

1 Stationarity, ergodicity

A process $\rho$ is stationary if for any $i,j\in 1..n$ and $B\in\mathcal{B}$ , we have

[TABLE]

A process $\rho$ is called ergodic if for every $B\in\mathcal{B}$ there exists a constant $v_{B}$ such that with probability 1 we have

[TABLE]

A process is called stationary ergodic if it is stationary and ergodic. The following statement follows from the ergodic theorem.

Theorem 1.1 (ergodic theorem).

For every stationary ergodic process $\rho$ , we have

[TABLE]

The proof of the ergodic theorem can be found, for example, in Gray:88 ; Shields:96 . The latter monograph also provides the connection to the more traditional way of defining ergodicity (in terms of shift-invariant sets); in particular, it demonstrates that the two approaches are equivalent.

The symbol $\mathcal{S}$ is used for the set of all stationary processes on $A^{\infty}$ , and the symbol $\mathcal{E}$ for the set of all stationary ergodic processes.

The set of all process distributions $\mathcal{P}$ over $A^{\infty}$ can be endowed with the structure of probability space $(\mathcal{P},\mathcal{B}_{\mathcal{P}})$ where $\mathcal{B}_{\mathcal{P}}$ can be taken to be the Borel sigma-algebra with respect to the distributional distance defined in Section 2 below.

The link between stationary and stationary ergodic processes is provided by the so-called ergodic decomposition theorem, which states that every stationary process is a mixture of stationary ergodic processes.

Theorem 1.2 (Ergodic decomposition).

For any $\rho\in\mathcal{S}$ there is a measure $W_{\rho}$ on $(\mathcal{P},\mathcal{B}_{\mathcal{P}})$ , such that $W_{\rho}(\mathcal{E})=1$ and

[TABLE]

for every $B\in\mathcal{{\mathcal{B}}}.$

Furthermore, a process is called asymptotically mean stationary, or AMS for short, if, for every $B\in\mathcal{B}$ , the frequency of $B$ converges with probability 1. These limiting frequencies define the stationary measure $\bar{\rho}$ , which, according to the preceding theorem, admits an ergodic decomposition. Asymptotically, $\rho$ and $\bar{\rho}$ are equivalent, and thus there will be little distinction between the two for us in this volume. For a detailed exposition of these results the reader is referred to Gray:88 , in particular to (Gray:88, , Theorem 7.4.1) that establishes ergodic decomposition for AMS processes.

2 Distributional distance

The general definition of the distributional distance is as follows.

Definition 2.1 (distributional distance).

Let $(B_{k})_{k\in\mathbb{N}}$ be a set of finite-time events each $B_{k}\in\mathcal{B}$ , $k\in\mathbb{N}$ that generates $\mathcal{B}_{\infty}$ , and let $(w_{k})_{k\in\mathbb{N}}$ be a sequence of positive reals such that $\sum_{k\in\mathbb{N}}w_{k}=1$ . For a pair of processes $\rho_{1},\rho_{2}$ the distributional distance $d(\rho_{1},\rho_{2})$ is defined as

[TABLE]

Note that there are two sets of parameters in this definition, $B_{k}$ and $w_{k}$ , which we shall now make more specified. Let us first fix

[TABLE]

The choice of the sets $B_{k}$ is more significant. Different choices may result in different topologies. In particular, some choices of $B_{k}$ make the set of all process distributions $\mathcal{P}$ compact with the topology of the distributional distance $d$ . This is the case if the set $B_{k},k\in\mathbb{N}$ is a standard basis of ${\mathcal{B}}_{\infty}$ . While there is a standard basis for ${\mathcal{B}}_{\infty}$ in the case of $A=\mathbb{R}$ , unfortunately, as Gray Gray:88 notes, there is no easy construction for such a basis even for the space of reals $(\mathbb{R},{\mathcal{B}})$ . In this volume, we shall not make much use of the notion of standard basis, but it will be important for us to have empirical estimates of the distributional distance. Therefore, we shall fix a specific choice of the sets $B_{k}$ for the case of discrete alphabets and for the case $A=\mathbb{R}$ (which is easily generalisable to $A=\mathbb{R}^{d}$ , $d\in\mathbb{N}$ ); we shall also make the definition of the distributional distance more specific reflecting these choices.

Definition 2.2 (Distributional distance for finitely-valued processes).

Let the alphabet $A$ be finite. Define

[TABLE]

While equivalent to the general one, this more-specified formulation is better suited for constructing practical algorithms: we are taking the differences in probabilities of each word of length $k$ , and then take a weighted sum over all $k\in\mathbb{N}$ .

For real-valued processes, we shall fix the usual set of cylinders to put in the distributional distance. Consider the sets $B^{m,l},m,l\in\mathbb{N}$ which are obtained via the partitioning of $A^{m}$ into cubes of dimension $m$ and volume $2^{-ml}$ , starting at the origin, and enumerated clockwise in each direction.

Definition 2.3 (Distributional distance for real-valued processes).

Let $A=\mathbb{R}$ . Define

[TABLE]

The general formulation (2) is more compact and thus more convenient for the theoretical analysis; we shall therefore use it in the proofs, while still assuming the concrete choice of the parameters $w_{k}$ and $B_{k}$ whenever necessary. The more specific formulations (4) and (5) are more convenient for constructing algorithms and empirical estimates.

Note, however, that the definition (5) is not exactly equivalent to the general definition (2). Indeed, each of the sets $B^{m,l}$ is infinite, and all the individual sets inside of $B^{m,l}$ are assigned the same weight $w_{m}w_{l}$ . This is not a problem, since the total $\rho_{1}$ - as well as $\rho_{2}$ - probability of all the sets in $B^{m,l}$ is 1. Indeed, it is a simple exercise to check that the proofs in the subsequent chapters go through for either of the definitions. We therefore take the liberty to use the definition (2) in the proofs, but refer to the more-specified definition (5) when speaking about the algorithms. The unconvinced reader may note that the sets $B^{m,l}$ can be made finite but growing with $l$ , i.e., defined so as to cover growing parts of the space $A^{m}$ with finer partitions, leaving all the rest of the space as a single element of the space $B^{m,l}$ . This way, the partitions become finite and the triple sum in (5) can be converted back to the single sum in (2) with a different choice of the weights $w_{k}$ .

It is easy to see that $d$ is a metric (with any choice of the parameters). When talking about closed and open subsets of $\mathcal{S}$ we assume the topology of $d$ . With this topology, the space $\mathcal{P}$ of process distributions is separable. The set $\mathcal{S}$ of stationary distributions is its closed subset. In addition, for the case of finite-valued alphabets, the sets $\mathcal{P}$ and $\mathcal{S}$ are complete and compact. (The general result (Gray:88, , Lemmas 8.2.1, 8.2.2) says that $\mathcal{P}$ is complete and compact in case the generating set $(B_{k})_{k\in\mathbb{N}}$ is standard; this is the case in our definition (4) but not in (5).) Proofs of these facts can be found in Gray:88 .

Chapter 2 Basic inference

In this chapter we consider some basic problems of statistical inference that underly the rest of the problems addressed in this volume. Namely, we shall see that the distributional distance can be estimated empirically, and consider some immediate implications of this fact. On the other hand, it is shown that there is no asymptotically consistent solution to the problem of discrimination (homogeneity testing) for stationary ergodic processes.

The main results of the chapter can be summarized as follows. {svgraybox}

•

The distributional distance between stationary ergodic processes can be estimated consistently.

•

There is no consistent discrimination procedure for stationary ergodic processes: no matter how long the sequences are, it is not possible to say whether they were generated by the same or different distributions.

•

Based on the estimates of the distributional distance, one can solve the three-sample problem: say which two of the given three samples were generated by the same distribution.

1 Estimating the distance between processes and reconstructing a process

The main building block of the approach presented in this book is the rather simple fact that the distributional distance can be estimated empirically, simply replacing unknown probabilities with frequencies. The resulting estimate is asymptotically consistent for arbitrary stationary ergodic processes.

Definition 1.1 (empirical distributional distance).

For samples ${\bf x},{\bf y}\in A^{*}$ , define empirical distributional distance $\hat{d}({\bf x},{\bf y})$ as

[TABLE]

Similarly, we can define the empirical distance when only one of the process measures is unknown:

[TABLE]

*where $\rho\in{\mathcal{E}}$ and ${\bf x}\in A^{*}$ . *

The following lemma establishes consistency of these estimates.

Lemma 1.2.

Let two samples ${\bf x}=(X_{1},\dots,X_{k})$ and ${\bf y}=(Y_{1},\dots,Y_{m})$ be generated by stationary ergodic processes $\rho_{\bf x}$ and $\rho_{\bf y}$ respectively. Then

(i)

$\lim_{k,m\rightarrow\infty}\hat{d}({\bf x},{\bf y})=d(\rho_{\bf x},\rho_{\bf y})\ \text{ a.s.}$ **

(ii)

$\lim_{k\rightarrow\infty}\hat{d}({\bf x},\rho_{\bf y})=d(\rho_{\bf x},\rho_{\bf y})\ \text{ a.s.}$ **

Proof 1.3.

For any $\varepsilon>0$ we can find such an index $J$ that $\sum_{i=J}^{\infty}w_{i}<\varepsilon/2$ . Moreover, by ergodic theorem, for each $j$ we have $\nu((X_{1},\dots,X_{k}),B_{j})\rightarrow\rho_{\bf x}(B_{j})$ a.s., so that, with probability 1,

[TABLE]

from some step $k$ on; define $K_{j}:=k$ . Let $K:=\max_{j<J}K_{j}$ ( $K$ depends on the realization $X_{1},X_{2},\dots$ ). Define analogously $M$ for the sequence $(Y_{1},\dots,Y_{m},\dots)$ . Thus, for $k>K$ and $m>M$ we have

[TABLE]

*which proves the first statement. The second statement can be proven analogously. *

Note that the second statement of the lemma implies that a stationary ergodic process (or an ergodic component of a stationary process) can be asymptotically reconstructed from growing segments of a sequence it generates.

While we shall not make use of this fact, it is also instructive to note that memory- $k$ approximations of a stationary process $\rho$ converge to $\rho$ in distributional distance. This fact is rather easy to see from the definitions.

2 Calculating $\hat{d}$

The expressions (1), (2) may seem impossible to calculate, since they involve infinite sums. However, as we shall see in this section, they are easy to calculate exactly and, furthermore, can be approximated using only quasilinear computational resources.

First of all, note that, for a finite sample, for finite alphabets there are only finitely many non-zero summands in (1) and (2). For real-valued alphabets, there are infinitely many non-zero summands, but most of these can be collapsed, as they have the same value.

We proceed with the more-specified versions of the empirical distributional distance, which are empirical estimates of (4) and (5). Given two samples ${\bf x}=X_{1..n_{1}}$ and ${\bf y}:=Y_{1..n_{2}}$ , let $n:=\max\{n_{1},n_{2}\}$ be the size of the longer sample and define

[TABLE]

and for real-valued processes

[TABLE]

where $k_{n},m_{n},l_{n}$ are integer-valued parameters that grow to infinity with $n$ .

First of all, note that any values of $k_{n},m_{n},l_{n}$ that monotonically increase to infinity still give consistent estimates of the distributional distance (e.g., one can check that the argument of the proof of Lemma 1.2 is unaffected). On the other hand, if we set $k_{n}\equiv\infty$ in (3), then the inner sum in (3) still has at most $n$ non-zero terms for $k\leq n$ and is 0 for $k>n$ . This makes the precise calculation of (3) at most quadratic.

Moreover, there is no reason to calculate the summands corresponding to $k_{n}\approx n$ since they are clearly not good estimates of the corresponding probabilities. In fact, it is reasonable to set $k_{n}$ of order $\log n$ , since longer subsamples are expected to be met at most once (see, for example, Kontoyiannis:94 ).

Similarly, for (4), let us begin by showing that calculating $\hat{d}$ is fully tractable with $m_{n},l_{n}\equiv\infty$ . Observe that for fixed $m$ and $l$ , the sum

[TABLE]

has not more than $n_{1}+n_{2}-2m+2$ nonzero terms (assuming $m\leq n_{1},n_{2}$ ; the other case is obvious). Indeed, there are $n_{1}-m+1$ tuples of size $m$ in the sequence ${\bf x}$ namely, $X_{1..m},X_{2..m+1},\dots,X_{n_{1}-m+1..n_{1}}$ and likewise for the sequence ${\bf y}$ . Therefore, $T^{m,l}$ can be obtained by a finite number of calculations.

Furthermore, let

[TABLE]

and observe that $T^{m,l}=0$ for all $m>n$ and for each $m$ , for all $l>\log s^{-1}$ the term $T^{m,l}$ is constant. That is, for each fixed $m$ we have

[TABLE]

so that we simply double the weight of the last nonzero term. (Note also that $s$ is bounded above by the length of the binary precision in representing the random variables $X_{i},Y_{j}$ .) Thus, even with $m_{n},l_{n}\equiv\infty$ one can calculate $\hat{d}$ precisely. Moreover, for a fixed $m\in 1..\log n$ and $l\in 1..\log s^{-1}$ for every sequence ${\bf x}$ the frequencies $\nu({\bf x},B),~{}B\in B^{m,l}$ may be calculated using suffix trees or suffix arrays, with ${\cal O}(n)$ worst case construction and search complexity (see, e.g., Ukkonen:95 ). Searching all $z:=n-m+1$ occurrences of subsequences of length $m$ results in ${\cal O}(m+z)={\cal O}(n)$ complexity. This brings the overall computational complexity of (4) to ${\cal O}(nm_{n}\log s^{-1})$ ; this can potentially be improved using specialized structures, e.g., Grossi:05 .

The parameters $m_{n}$ play the same role as $k_{n}$ in the discrete case, and so can be set to be of order $\log n$ for the same reason. Finally, to choose $l_{n}<\infty$ one can either fix some constant based on the bound on the precision in real computations, or choose it in such a way that each cell $B^{m,l_{n}}$ contains no more than $\log n$ points for all $m=1..\log n$ largest values of $l_{n}$ . Thus, we arrive at the following conclusion. {svgraybox} Empirical distributional distance (3), (4) is efficiently computable, and can be approximated using only quasilinear computational resources.

3 The three-sample problem

Let there be given three samples ${\bf x},{\bf y},{\bf z}\in A^{*}$ . Each sample is generated by a stationary ergodic process $\rho_{\bf x}$ , $\rho_{\bf y}$ and $\rho_{\bf z}$ respectively. Moreover, it is known that either $\rho_{\bf z}=\rho_{\bf x}$ or $\rho_{\bf z}=\rho_{\bf y}$ , but $\rho_{\bf x}\neq\rho_{\bf y}$ . We wish to construct a test that, based on the finite samples ${\bf x},{\bf y}$ and ${\bf z}$ will tell whether $\rho_{\bf z}=\rho_{\bf x}$ or $\rho_{\bf z}=\rho_{\bf y}$ .

This problem is known under the names of three-sample problem and (process) classification. Its i.i.d. version, i.e., the case when each of the samples consists of i.i.d. random variables, is one of the classical problems of mathematical statistics (e.g., Lehmann:86 ). The case of dependent time series was considered in Gutman:89 , where a solution is presented under the finite-memory assumption. The material presented here is based on Ryabko:103s .

Essentially, the problem is to answer the question “which distribution is closer to which other distribution” based on the three samples given. The test we shall consider is doing this based on the estimates of the distributional distance.

Thus, let us consider a test that chooses the sample ${\bf x}$ or ${\bf y}$ according to whichever is closer to ${\bf z}$ in $\hat{d}$ . That is, we define the test $L({\bf x},{\bf y},{\bf z})$ as follows. If $\hat{d}({\bf x},{\bf z})\leq\hat{d}({\bf y},{\bf z})$ then the test says that the sample ${\bf z}$ is generated by the same process as the sample ${\bf x}$ , otherwise it says that the sample ${\bf z}$ is generated by the same process as the sample ${\bf y}$ .

Definition 3.1 (Process classifier).

Define the classifier $L:A^{*}\times A^{*}\times A^{*}\rightarrow\{\text{``x'',``y''}\}$ as follows

[TABLE]

for ${\bf x},{\bf y},{\bf z}\in A^{*}$ .

Theorem 3.2.

The test $L({\bf x},{\bf y},{\bf z})$ makes only a finite number of errors when $|{\bf x}|,|{\bf y}|$ and $|{\bf z}|$ go to infinity, with probability 1: if $\rho_{\bf x}=\rho_{\bf z}$ then

[TABLE]

from some $|{\bf x}|,|{\bf y}|,|{\bf z}|$ on with probability 1; otherwise

[TABLE]

*from some $|{\bf x}|,|{\bf y}|,|{\bf z}|$ on with probability 1. *

Proof 3.3.

From the fact that $d$ is a metric and from Lemma 1.2 we conclude that $\hat{d}({\bf x},{\bf z})\rightarrow 0$ (with probability 1) if and only if $\rho_{\bf x}=\rho_{\bf z}$ . So, if $\rho_{\bf x}=\rho_{\bf z}$ then by assumption $\rho_{\bf y}\neq\rho_{\bf z}$ and $\hat{d}({\bf x},{\bf z})\rightarrow 0$ a.s. while

[TABLE]

*Thus in this case $\hat{d}({\bf y},{\bf z})>\hat{d}({\bf x},{\bf z})$ from some $|{\bf x}|,|{\bf y}|,|{\bf z}|$ on with probability 1, from which moment we have $L({\bf x},{\bf y},{\bf z})=\text{ ``x'' }$ . The opposite case is analogous. *

4 Impossibility of discrimination

The following problem is variously known as (process) discrimination, homogeneity testing or two-sample testing. For the asymptotic version we consider here the name process discrimination is more suited, and so this is the name we adopt in this section, reserving the name homogeneity testing for other versions.

Two series of observations $X_{1},X_{2},\dots,X_{n},\dots$ and $Y_{1},Y_{2},\dots,Y_{n},\dots$ are presented sequentially. On each time step $n$ we would like to say whether the distributions generating the samples $X_{1},\dots,X_{n}$ and $Y_{1},\dots,Y_{n}$ are the same or different. In this section we are after an impossibility result, so we restrict the consideration to the case of the binary-valued processes.

Here we shall see that there is no asymptotically consistent discrimination procedure for the stationary ergodic processes with binary alphabet. The notion of consistency is perhaps the weakest one can think of: it is shown that for any discrimination procedure its expected answer does not converge to the correct one at least for some processes. In fact, a stronger result is established, showing that there is no asymptotically consistent discrimination procedure for a smaller set of process, namely, that of $B$ processes. The class of B-processes (formally defined below) is sufficiently wide to include, for example, $k$ -order Markov processes and functions thereof, but, on the other hand, it is a strict subset of the set of stationary ergodic processes.

The material of this section is after Ryabko:10discr . The additional definitions introduced ( $B$ processes, $\bar{d}$ -distance) as well as the proof of the main theorem are not necessary for understanding the material of the subsequent chapters.

1 Setup and definitions

Let the alphabet be binary, $A:=\{0,1\}$ . A discrimination procedure (or a homogeneity test) $D$ is a family of mappings $D_{n}:A^{n}\times A^{n}\rightarrow\{0,1\}$ , $n\in\mathbb{N}$ , that maps a pair of samples $(X_{1},\dots,X_{n})$ , $(Y_{1},\dots,Y_{n})$ into a binary (“yes” or “no”) answer: the samples are generated by different distributions, or they are generated by the same distribution.

A discrimination procedure $D$ is asymptotically consistent for a set $\mathcal{C}$ of process distributions if for any two distributions $\rho_{\bf x},\rho_{\bf y}\in\mathcal{C}$ independently generating the sequences $X_{1},X_{2},\dots$ and $Y_{1},Y_{2},\dots$ correspondingly the expected output converges to the correct answer: the following limit exists and the equality holds

[TABLE]

This is perhaps the weakest notion of correctness one can consider.

Clearly, asymptotically consistent discriminating procedures exist for many classes of processes, for example for the class of all i.i.d. processes (e.g. Lehmann:86 ) and various parametric families. Indeed, for i.i.d. samples one usually requires stronger forms of consistency than the asymptotic notion considered here.

To be able to define the set of $B$ -processes, we need to introduce another distance between process distributions, the $\bar{d}$ distance.

For two finite-valued stationary processes $\rho_{\bf x}$ and $\rho_{\bf y}$ the $\bar{d}$ -distance $\bar{d}(\rho_{\bf x},\rho_{\bf y})$ is said to be less than $\varepsilon$ if there exists a single stationary process $\nu_{xy}$ on pairs $(X_{n},Y_{n})$ , $n\in\mathbb{N}$ , such that $X_{n}$ , $n\in\mathbb{N}$ are distributed according to $\rho_{\bf x}$ and $Y_{n}$ are distributed according to $\rho_{\bf y}$ while

[TABLE]

The infimum of the $\varepsilon$ ’s for which a coupling can be found such that (8) is satisfied is taken to be the $\bar{d}$ -distance between $\rho_{\bf x}$ and $\rho_{\bf y}$ .

Definition 4.1.

*A process is called a $B$ -process (or a Bernoulli process) if it is in the $\bar{d}$ -closure of the set of all aperiodic stationary ergodic $k$ -step Markov processes, where $k\in\mathbb{N}$ . *

For more information on $\bar{d}$ -distance and $B$ -processes see Ornstein:74 .

2 The main result

Theorem 4.2.

*There is no asymptotically consistent discrimination procedure for the set of all $B$ -processes. *

Before presenting the proof, it is worth putting this result in the context of other results on $B$ -processes. As mentioned in the introduction, Ornstein and Weiss Ornstein:90 construct an estimator $\bar{s}_{n}$ such that

[TABLE]

if both processes $\rho_{1}$ and $\rho_{2}$ generating the samples $X_{i}$ and $Y_{i}$ respectively are $B$ -processes. In the same work it is shown that there is no estimator $\bar{s}_{n}$ for which (9) holds for every pair $\rho_{1},\rho_{2}$ of stationary ergodic processes.

Comparing these result to those on distributional distance presented in the previous section (namely, Lemma 1.2), we can say that the stronger the distance the harder it is to estimate: the distributional distance can be consistently estimated for stationary ergodic processes, the $\bar{d}$ distance can be consistently estimated for $B$ -processes but not for stationary ergodic processes, while the strongest possible distance— the one that gives discrete topology, cannot be consistently estimated for $B$ -processes, as shown in this section.

It is also worth noting that the proof given below yields a slightly stronger results, namely, the impossibility of discrimination between finite-dimensional (including single-dimensional) marginals of the processes. Specifically, correctness of the discrimination procedure (7) can be replaced with the following

[TABLE]

with the same proof carrying over.

The proof, presented below, is by contradiction. It is assumed that a consistent discrimination procedure exists, and a process is exhibited that will trick such a procedure to give divergent results. The construction on which the proof is based uses the ideas of the “random walk over the diagonal” construction used in BRyabko:88 to demonstrate that consistent prediction for stationary ergodic processes is impossible (see also its exposition in Gyorfi:98 ).

Proof 4.3.

We will assume that asymptotically consistent discrimination procedure $D$ for the class of all $B$ -processes exists, and will construct a $B$ -process $\rho$ such that if both sequences $X_{i}$ and $Y_{i}$ , $i\in\mathbb{N}$ are generated by $\rho$ then $\mathbb{E}D_{n}$ diverges; this contradiction will prove the theorem.

The scheme of the proof is as follows. On Step 1 we construct a sequence of processes $\rho_{2k}$ , $\rho_{d2k+1}$ , and $\rho_{u2k+1}$ , where $k=0,1,\dots$ . On Step 2 we construct a process $\rho$ , which is shown to be the limit of the sequence $\rho_{2k}$ , $k\in\mathbb{N}$ , in $\bar{d}$ -distance. On Step 3 we show that two independent runs of the process $\rho$ have a property that (with high probability) they first behave like two runs of a single process $\rho_{0}$ , then like two runs of two different processes $\rho_{u1}$ and $\rho_{d1}$ , then like two runs of a single process $\rho_{2}$ , and so on, thereby showing that the test $D$ diverges and obtaining the desired contradiction.

Assume that there exists an asymptotically consistent discriminating procedure $D$ . Fix some $\varepsilon\in(0,1/2)$ and $\delta\in[1/2,1)$ , to be defined on Step 3.

Step 1.* We will construct the sequence of process $\rho_{2k}$ , $\rho_{u2k+1}$ , and $\rho_{d2k+1}$ , where $k=0,1,\dots$ .*

Step 1.0.* Construct the process $\rho_{0}$ as follows. A Markov chain $m_{0}$ is defined on the set $\mathbb{N}$ of states. From each state $i\in\mathbb{N}$ the chain passes to the state [math] with probability $\delta$ and to the state ${i+1}$ with probability $1-\delta$ . With transition probabilities so defined, the chain possesses a unique stationary distribution $M_{0}$ on the set $\mathbb{N}$ , which can be calculated explicitly using e.g. (Shiryaev:96, , Theorem VIII.4.1), and is as follows: $M_{0}(0)=\delta$ , $M_{0}(k)=\delta(1-\delta)^{k}$ , for all $k\in\mathbb{N}$ . Take this distribution as the initial distribution over the states.*

The function $f_{0}$ maps the states to the output alphabet $\{0,1\}$ as follows: $f_{0}(i)=1$ for every $i\in\mathbb{N}$ . Let $s_{t}$ be the state of the chain at time $t$ . The process $\rho_{0}$ is defined as $\rho_{0}=f_{0}(s_{t})$ for $t\in\mathbb{N}$ . As a result of this definition, the process $\rho_{0}$ simply outputs $1$ with probability $1$ on every time step (however, by using different functions $f$ we will have less trivial processes in the sequel). Clearly, the constructed process is stationary ergodic and a B-process. So, we have defined the chain $m_{0}$ (and the process $\rho_{0}$ ) up to a parameter $\delta$ .

Step 1.1.* We begin with the process $\rho_{0}$ and the chain $m_{0}$ of the previous step. Since the test D is asymptotically consistent we will have*

[TABLE]

from some $t_{0}$ on, where both samples $X_{i}$ and $Y_{i}$ are generated by $\rho_{0}$ (that is, both samples consist of 1s only). Let $k_{0}$ be such an index that the chain $m_{0}$ starting from the state [math] with probability $1$ does not reach the state $k_{0}-1$ by time $t_{0}$ (we can take $k_{0}=t_{0}+2$ ).

Construct two processes $\rho_{u1}$ and $\rho_{d1}$ as follows. They are also based on the Markov chain $m_{0}$ , but the functions $f$ are different. The function $f_{u1}:\mathbb{N}\rightarrow\{0,1\}$ is defined as follows: $f_{u1}(i)=f_{0}(i)=1$ for $i\leq k_{0}$ and $f_{u1}(i)=0$ for $i>k_{0}$ . The function $f_{d1}$ is identically $1$ ( $f_{d1}(i)=1$ , $i\in\mathbb{N}$ ). The processes $\rho_{u1}$ and $\rho_{d1}$ are defined as $\rho_{u1}=f_{u1}(s_{t})$ and $\rho_{d1}=f_{d1}(s_{t})$ for $t\in\mathbb{N}$ . Thus the process $\rho_{d1}$ will again produce only 1s, but the process $\rho_{u1}$ will occasionally produce 0s.

Step 1.2.* Being run on two samples generated by the processes $\rho_{u1}$ and $\rho_{d1}$ which both start from the state 0, the test $D_{n}$ on the first $t_{0}$ steps produces many 0s, since on these first $k_{0}$ states all the functions $f$ , $f_{u1}$ and $f_{d1}$ coincide. However, since the processes are different and the test is asymptotically consistent (by assumption), the test starts producing 1s, until by a certain time step $t_{1}$ almost all answers are 1s. Next we will construct the process $\rho_{2}$ by “gluing” together $\rho_{u1}$ and $\rho_{d1}$ and continuing them in such a way that, being run on two samples produced by $\rho_{2}$ the test first produces 0s (as if the samples were drawn from $\rho_{0}$ ), then, with probability close to 1/2 it will produce many 1s (as if the samples were from $\rho_{u1}$ and $\rho_{d1}$ ) and then again 0s.*

The process $\rho_{2}$ is the pivotal point of the construction, so we give it in some detail. On step 1.2a we present the construction of the process, and on step 1.2b we show that this process is a $B$ -process by demonstrating that it is equivalent to a (deterministic) function of a Markov chain.

Step 1.2a.* Let $t_{1}>t_{0}$ be such a time index that*

[TABLE]

where the samples $X_{i}$ and $Y_{i}$ are generated by $\rho_{u1}$ and $\rho_{d1}$ correspondingly (the samples are generated independently; that is, the process are based on two independent copies of the Markov chain $m_{0}$ ). Let $k_{1}>k_{0}$ be such an index that the chain $m$ starting from the state 0 with probability $1$ does not reach the state $k_{1}-1$ by time $t_{1}$ .

Construct the process $\rho_{2}$ as follows (see fig. 1).

It is based on a chain $m_{2}$ on which Markov assumption is violated. The transition probabilities on states $0,\dots,k_{0}$ are the same as for the Markov chain $m$ (from each state return to 0 with probability $\delta$ or go to the next state with probability $1-\delta$ ).

There are two “special” states: the “switch” $S_{2}$ and the “reset” $R_{2}$ . From the state $k_{0}$ the chain passes with probability $1-\delta$ to the “switch” state $S_{2}$ . The switch $S_{2}$ can itself have two values: $up$ and $down$ . If $S_{2}$ has the value $up$ then from $S_{2}$ the chain passes to the state $u_{k_{0}+1}$ with probability 1, while if $S_{2}=down$ the chain goes to $d_{k_{0}+1}$ , with probability 1. If the chain reaches the state $R_{2}$ then the value of $S_{2}$ is set to $up$ with probability 1/2 and with probability 1/2 it is set to $down$ . In other words, the first transition from $S_{2}$ is random (either to $u_{k_{0}+1}$ or to $d_{k_{0}+1}$ with equal probabilities) and then this decision is remembered until the “reset” state $R_{2}$ is visited, whereupon the switch again assumes the values $up$ and $down$ with equal probabilities.

The rest of the transitions are as follows. From each state $u_{i}$ , $k_{0}\leq i\leq k_{1}$ the chain passes to the state [math] with probability $\delta$ and to the next state $u_{i+1}$ with probability $1-\delta$ . From the state $u_{k_{1}}$ the process goes with probability $\delta$ to 0 and with probability $1-\delta$ to the “reset” state $R_{2}$ . The same with states $d_{i}$ : for $k_{0}<i\leq k_{1}$ the process returns to 0 with probability $\delta$ or goes to the next state $d_{i+1}$ with probability $1-\delta$ , where the next state for $d_{k_{1}}$ is the “reset” state $R_{2}$ . From $R_{2}$ the process goes with probability 1 to the state $k_{1}+1$ where from the chain continues ad infinitum: to the state 0 with probability $\delta$ or to the next state $k_{1}+2$ etc. with probability $1-\delta$ .

The initial distribution on the states is defined as follows. The probabilities of the states $0..k_{0},k_{1}+1,k_{1}+2,\dots$ are the same as in the Markov chain $m_{0}$ , that is, $\delta(1-\delta)^{j}$ , for $j=0..k_{0},k_{1}+1,k_{1}+2,\dots$ . For the states $u_{j}$ and $d_{j}$ , $k_{0}<j\leq k_{1}$ define their initial probabilities to be 1/2 of the probability of the corresponding state in the chain $m_{0}$ , that is $m_{2}(u_{j})=m_{2}(d_{j})=m_{0}(j)/2=\delta(1-\delta)^{j}/2$ . Furthermore, if the chain starts in a state $u_{j}$ , $k_{0}<j\leq k_{1}$ , then the value of the switch $S_{2}$ is $up$ , and if it starts in the state $d_{j}$ then the value of the switch $S_{2}$ is $down$ , whereas if the chain starts in any other state then the probability distribution on the values of the switch $S_{2}$ is 1/2 for either $up$ or $down$ .

The function $f_{2}$ is defined as follows: $f_{2}(i)=1$ for $0\leq i\leq k_{0}$ and $i>k_{1}$ (before the switch and after the reset); $f_{2}(u_{i})=0$ for all $i$ , $k_{0}<i\leq k_{1}$ and $f_{2}(d_{i})=1$ for all $i$ , $k_{0}<i\leq k_{1}$ . The function $f_{2}$ is undefined on $S_{2}$ and $R_{2}$ , therefore there is no output on these states (we also assume that passing through $S_{2}$ and $R_{2}$ does not increment time). As before, the process $\rho_{2}$ is defined as $\rho_{2}=f_{2}(s_{t})$ where $s_{t}$ is the state of $m_{2}$ at time $t$ , omitting the states $S_{2}$ and $R_{2}$ . The resulting process s illustrated on fig. 1.

Step 1.2b.* To show that the process $\rho_{2}$ is stationary ergodic and a $B$ -process, we will show that it is equivalent to a function of a stationary ergodic Markov chain, whereas all such process are known to be $B$ (e.g. Shields:96 ). The construction is as follows (see fig. 2). This chain has states $k_{1}+1,\dots$ and also $u_{0},\dots,u_{k_{0}},u_{k_{0}+1},\dots,u_{k_{1}}$ and $d_{0},\dots,d_{k_{0}},d_{k_{0}+1},\dots,d_{k_{1}}$ .*

From the states $u_{i}$ , $i=0,\dots,k_{1}$ the chain passes with probability $1-\delta$ to the next state $u_{i+1}$ , where the next state for $u_{k_{1}}$ is $k+1$ and with probability $\delta$ returns to the state $u_{0}$ (and not to the state 0). Transitions for the state $d_{0},\dots,d_{k_{1}-1}$ are defined analogously. Thus the states $u_{k_{i}}$ correspond to the state $up$ of the switch $S_{2}$ and the states $d_{k_{i}}$ — to the state $down$ of the switch. Transitions for the states $k+1,k+2,\dots$ are defined as follows: with probability $\delta/2$ to the state $u_{0}$ , with probability $\delta/2$ to the state $d_{0}$ , and with probability $1-\delta$ to the next state. Thus, transitions to 0 from the states with indices greater than $k_{1}$ corresponds to the reset $R_{2}$ . Clearly, the chain $m_{2}^{\prime}$ as defined possesses a unique stationary distribution $M_{2}$ over the set of states and $M_{2}(i)>0$ for every state $i$ . Moreover, this distribution is the same as the initial distribution on the states of the chain $m_{0}$ , except for the states $u_{i}$ and $d_{i}$ , for which we have $m_{2}^{\prime}(u_{i})=m_{2}^{\prime}(d_{i})=m_{0}(i)/2=\delta(1-\delta)^{i}/2$ , for $0\leq i\leq k_{0}$ . We take this distribution as its initial distribution on the states of $m_{2}^{\prime}$ . The resulting process $m_{2}^{\prime}$ is stationary ergodic, and a $B$ -process, since it is a function of a Markov chain Shields:96 . It is easy to see that if we define the function $f_{2}$ on the states of $m_{2}^{\prime}$ as 1 on all states except $u_{k_{0}+1},\dots,u_{k_{1}}$ , then the resulting process is exactly the process $\rho_{2}$ . Therefore, $\rho_{2}$ is stationary ergodic and a $B$ -process.

Step 1. $k$ .* As before, we can continue the construction of the processes $\rho_{u3}$ and $\rho_{d3}$ , that start with a segment of $\rho_{2}$ . Let $t_{2}>t_{1}$ be a time index such that*

[TABLE]

where both samples are generated by $\rho_{2}$ . Let $k_{2}>k_{1}$ be such an index that when starting from the state 0 the process $m_{2}$ with probability 1 does not reach $k_{2}-1$ by time $t_{2}$ (equivalently: the process $m_{2}^{\prime}$ does not reach $k_{2}-1$ when starting from either $u_{0}$ or $d_{0}$ ). The processes $\rho_{u3}$ and $\rho_{d3}$ are based on the same process $m_{2}$ as $\rho_{2}$ . The functions $f_{u3}$ and $f_{d3}$ coincide with $f_{2}$ on all states up to the state $k_{2}$ (including the states $u_{i}$ and $d_{i}$ , $k_{0}<i\leq k_{1}$ ). After $k_{2}$ the function $f_{u3}$ outputs 0s while $f_{d3}$ outputs 1s: $f_{u3}(i)=0$ , $f_{d3}(i)=1$ for $i>k_{2}$ .

Furthermore, we find a time $t_{3}>t_{2}$ by which we have $\mathbb{E}_{\rho_{u3}\times\rho_{d3}}D_{t_{3}}>1-\varepsilon,$ where the samples are generated by $\rho_{u3}$ and $\rho_{d3}$ , which is possible since $D$ is consistent. Next, find an index $k_{3}>k_{2}$ such that the process $m_{2}$ does not reach $k_{3}-1$ with probability $1$ if the processes $\rho_{u3}$ and $\rho_{d3}$ are used to produce two independent sequences and both start from the state 0. We then construct the process $\rho_{4}$ based on a (non-Markovian) process $m_{4}$ by “gluing” together $\rho_{u3}$ and $\rho_{d3}$ after the step $k_{3}$ with a switch $S_{4}$ and a reset $R_{4}$ exactly as was done when constructing the process $\rho_{2}$ . The process $m_{4}$ is illustrated on fig. 3a). The process $m_{4}$ can be shown to be equivalent to a Markov chain $m_{4}^{\prime}$ , which is constructed analogously to the chain $m_{2}^{\prime}$ (see fig. 3b). Thus, the process $\rho_{4}$ is can be shown to be a $B$ -process.

Proceeding this way we can construct the processes $\rho_{2j}$ , $\rho_{u2j+1}$ and $\rho_{d2j+1}$ , $j\in\mathbb{N}$ choosing the time steps $t_{j}>t_{j-1}$ so that the expected output of the test approaches 0 by the time $t_{j}$ being run on two samples produced by $\rho_{j}$ for even $j$ , and approaches 1 by the time $t_{j}$ being run on samples produced by $\rho_{uj}$ and $\rho_{dj}$ for odd $j$ :

[TABLE]

and

[TABLE]

For each $j$ the number $k_{j}>k_{j-1}$ is selected in a such a way that the state $k_{j}-1$ is not reached (with probability 1) by the time $t_{j}$ when starting from the state 0. Each of the processes $\rho_{2j}$ , $\rho_{u2j+1}$ and $\rho_{dj2+1}$ , $j\in\mathbb{N}$ can be shown to be stationary ergodic and a $B$ -process by demonstrating equivalence to a Markov chain, analogously to the Step 1.2. The initial state distribution of each of the processes $\rho_{t},t\in\mathbb{N}$ is $M_{t}(k)=\delta(1-\delta)^{k}$ and $M_{t}(u_{k})=M_{t}(d_{k})=\delta(1-\delta)^{k}/2$ for those $k\in\mathbb{N}$ for which the corresponding states are defined.

Step 2.* Having defined $k_{j}$ , $j\in\mathbb{N}$ we can define the process $\rho$ . The construction is given on Step 2a, while on Step 2b we show that $\rho$ is stationary ergodic and a $B$ -process, by showing that it is the limit of the sequence $\rho_{2j}$ , $j\in\mathbb{N}$ .*

Step 2a.* The process $\rho$ can be constructed as follows (see fig. 4).*

The construction is based on the (non-Markovian) process $m_{\rho}$ that has states $0,\dots,k_{0}$ , $k_{2j+1}+1,\dots,k_{2(j+1)}$ , $u_{k_{2j}+1},\dots,u_{k_{2j+1}}$ and $d_{k_{2j}+1},\dots,d_{k_{2j+1}}$ for $j\in\mathbb{N}$ , along with switch states $S_{2j}$ and reset states $R_{2j}$ . Each switch $S_{2j}$ diverts the process to the state $u_{k_{2j}+1}$ if the switch has value $up$ and to $d_{k_{2j}+1}$ if it has the value $down$ . The reset $R_{2j}$ sets $S_{2j}$ to $up$ with probability 1/2 and to $down$ also with probability 1/2. From each state that is neither a reset nor a switch, the process goes to the next state with probability $1-\delta$ and returns to the state 0 with probability $\delta$ (cf. Step 1 $k$ ).

The initial distribution $M_{\rho}$ on the states of $m_{\rho}$ is defined as follows. For every state $i$ such that $0\leq i\leq k_{0}$ and $k_{2j+1}<i\leq k_{2_{j}+2}$ , $j=0,1,\dots$ , define the initial probability of the state $i$ as $M_{\rho}(i)=\delta(1-\delta)^{i}$ (the same as in the chain $m_{0}$ ), and for the sets $u_{j}$ and $d_{j}$ (for those $j$ for which these sets are defined) let $M_{\rho}(u_{j})=M_{\rho}(d_{j}):=\delta(1-\delta)^{i}/2$ (that is, 1/2 of the probability of the corresponding state of $m_{0}$ ).

The function $f$ is defined as 1 everywhere except for the states $u_{j}$ (for all $j\in\mathbb{N}$ for which $u_{j}$ is defined) on which $f$ takes the value 0. The process $\rho$ is defined at time $t$ as $f(s_{t})$ , where $s_{t}$ is the state of $m_{\rho}$ at time $t$ .

Step 2b.* To show that $\rho$ is a $B$ -process, let us first show that it is stationary. Recall the definition 2 of the distributional distance between (arbitrary) process distributions. The set of all stochastic processes, equipped with this distance, is complete, and the set of all stationary processes is its closed subset Gray:88 . Thus, to show that the process $\rho$ is stationary it suffices to show that $\lim_{j\to\infty}d(\rho_{2j},\rho)=0$ , since the processes $\rho_{2j}$ , $j\in\mathbb{N}$ , are stationary. To do this, it is enough to demonstrate that*

[TABLE]

for each $B\in A^{*}$ . Since the processes $m_{\rho}$ and $m_{2j}$ coincide on all states up to $k_{2j+1}$ , we have

[TABLE]

for every $n\in\mathbb{N}$ and $a\in A$ . Moreover, for any tuple $B\in A^{*}$ we obtain

[TABLE]

where the convergence follows from $k_{2j}\to\infty$ . We conclude that (13) holds true, so that $d(\rho,\rho_{2j})\to 0$ and $\rho$ is stationary.

To show that $\rho$ is a $B$ -process, we will demonstrate that it is the limit of the sequence $\rho_{2k}$ , $k\in\mathbb{N}$ in the $\bar{d}$ distance (which was only defined for stationary processes). Since the set of all $B$ -process is a closed subset of all stationary processes, it will follow that $\rho$ itself is a $B$ -process. (Observe that this way we get ergodicity of $\rho$ “for free”, since the set of all ergodic processes is closed in $\bar{d}$ distance, and all the processes $\rho_{2j}$ are ergodic.) In order to show that $\bar{d}(\rho,\rho_{2k})\to 0$ we have to find for each $j$ a processes $\nu_{2j}$ on pairs $(X_{1},Y_{1}),(X_{2},Y_{2}),\dots$ , such that $X_{i}$ are distributed according to $\rho$ and $Y_{i}$ are distributed according to $\rho_{2j}$ , and such that $\lim_{j\to\infty}\nu_{2j}(X_{1}\neq Y_{1})=0$ . Construct such a coupling as follows. Consider the chains $m_{\rho}$ and $m_{2j}$ , which start in the same state (with initial distribution being $M_{\rho}$ ) and always take state transitions together, where if the process $m_{\rho}$ is in the state $u_{t}$ or $d_{t}$ , $t\geq k_{2j+1}$ (that is, one of the states which the chain $m_{2j}$ does not have) then the chain $m_{2j}$ is in the state $t$ . The first coordinate of the process $\nu_{2j}$ is obtained by applying the function $f$ to the process $m_{\rho}$ and the second by applying $f_{2j}$ to the chain $m_{2j}$ . Clearly, the distribution of the first coordinate is $\rho$ and the distribution of the second is $\rho_{2j}$ . Since the chains start in the same state and always take state transitions together, and since the chains $m_{\rho}$ and $m_{2j}$ coincide up to the state $k_{2j+1}$ we have $\nu_{2j}(X_{1}\neq Y_{1})\leq\sum_{k>k_{2j+1}}M_{\rho}(k)\to 0$ . Thus, $\bar{d}(\rho,\rho_{2j})\to 0$ , so that $\rho$ is a $B$ -process.

Step 3.* Finally, it remains to show that the expected output of the test $D$ diverges if the test is run on two independent samples produced by $\rho$ .*

Recall that for all the chains $m_{2j}$ , $m_{u2j+1}$ and $m_{d2j+1}$ as well as for the chain $m_{\rho}$ , the initial probability of the state 0 is $\delta$ . By construction, if the process $m_{\rho}$ starts at the state 0 then up to the time step $k_{2j}$ it behaves exactly as $\rho_{2j}$ that has started at the state 0. In symbols, we have

[TABLE]

for $j\in\mathbb{N}$ , where $s_{0}^{x}$ and $s_{0}^{y}$ denote the initial states of the processes generating the samples $X$ and $Y$ correspondingly.

We will use the following simple decomposition

[TABLE]

From this, (14) and (11) we have

[TABLE]

For odd indices, if the process $\rho$ starts at the state 0 then (from the definition of $t_{2j+1}$ ) by the time $t_{2j+1}$ it does not reach the reset $R_{2j}$ ; therefore, in this case the value of the switch $S_{2j}$ does not change up to the time $t_{2j+1}$ . Since the definition of $m_{\rho}$ is symmetric with respect to the values $up$ and $down$ of each switch, the probability that two samples $x_{1},\dots,x_{t_{2j+1}}$ and $y_{1},\dots,y_{t_{2j+1}}$ generated independently by (two runs of) the process $\rho$ produced different values of the switch $S_{2j}$ when passing through it for the first time is 1/2. In other words, with probability 1/2 two samples generated by $\rho$ starting at the state 0 will look by the time $t_{2j+1}$ as two samples generated by $\rho_{u2j+1}$ and $\rho_{d2j+1}$ that has started at state 0. Thus

[TABLE]

for $j\in\mathbb{N}$ . Using this, (15), and (12) we obtain

[TABLE]

*Taking $\delta$ large and $\varepsilon$ small (e.g. $\delta=0.9$ and $\varepsilon=0.1$ ), we can make the bound (16) close to 0 and the bound (18) close to 1/2, and the expected output of the test will cross these values infinitely often. Therefore, we have shown that the expected output of the test $D$ diverges on two independent runs of the process $\rho$ , contradicting the consistency of $D$ . This contradiction concludes the proof. *

Chapter 3 Clustering and change-point problems

In the previous chapter we have considered some basic questions of statistical inference. It was established that, when speaking about stationary ergodic processes, one can answer questions like “which distribution is closer to which” but not “are these distributions the same,” based on samples. In this chapter we shall see how these questions come into play when considering more complex problems, namely, clustering and change-point problems.

Clustering is grouping together samples generated by the same distributions, while change-point problems are concerned with delimiting parts of a sample that are generated by a single process distribution. At first glance, it seems that this kind of questions should be impossible to solve, since we cannot even answer the simple “same-different” question about distributions. However, we shall see that often, and mainly in the case when the total number of different distributions is known, these questions can be reduced to answering the “which one is closer” question, and thus admit a solution.

All the algorithms that are mentioned in this chapter do not present any significant computational challenges, perhaps except for calculating the distributional distance (see Section 2 above about that). Therefore, we omit algorithmic and implementational details; the interested reader can find these in the corresponding papers that also present experimental evaluations of the algorithms: Khaleghi:15clust for clustering and Khaleghi:12mchp ; Khaleghi:14 ; Khaleghi:15chp for change-point problems. The material in this chapter is mainly after Ryabko:10clust ; Khaleghi:15clust for clustering and Ryabko:103s for change-point problems, with some results of Khaleghi:12mchp ; Khaleghi:14 ; Khaleghi:15chp ; Ryabko:17clin given without proofs.

1 Time-series clustering

Given a finite set of objects, the problem of “clustering” similar objects together, in the absence of any examples of “good” clusterings, is notoriously hard to formalize. Most of the work on clustering is concerned with particular parametric data-generating models, or with analysing particular algorithms, a given similarity measure, and (very often) a given number of clusters. It is clear that, as in almost learning problems, in clustering finding the right similarity measure is an integral part of the problem. However, even if one assumes the similarity measure known, it is hard to define what a good clustering is Kleinberg:02 ; Zadeh:09 . What is more, even if one assumes the similarity measure to be simply the Euclidean distance (on the plane), and the number of clusters $k$ known, then clustering may still appear intractable for computational reasons Mahajan:09 .

The problem acquires a different angle when one wishes to cluster processes. That is, each data point is itself a time-series sample. This version of the problem has numerous applications, such as clustering biological data, financial observations, or behavioural patterns, and as such it has gained a tremendous attention in the literature.

A crucial observation to make in the case of clustering processes, is that one can benefit from the notion of ergodicity to define what appears to be a very natural notion of consistency. Ergodicity means that the distribution of a sample can be determined in asymptotic, or approximated arbitrary well if the sample size is long enough. This makes the the following goal achievable. {svgraybox} Given $N$ samples ${\bf x}_{1}=(x^{1}_{1},\dots,x^{1}_{n_{1}}),\dots,{\bf x}_{N}=(x^{N}_{1},\dots,x^{N}_{n_{N}})$ , each drawn by one out of $\kappa$ unknown process distributions, group together those and only those samples that were generated by the same distribution.

The samples are ${\bf x}_{j}$ are not assumed to be drawn independently; rather, it is assumed that the joint distribution of the samples is stationary ergodic. The target clustering is as follows: those and only those samples are put into the same cluster that were generated by the same distribution. A clustering algorithm is called asymptotically consistent if it outputs only the correct answer with probability 1 from some $n$ on, where $n$ is the length of the shortest sample, $n:=\min\{n_{1},\dots,n_{N}\}.$ Note the particular regime of asymptotic: not with respect to the number of samples $N$ , but with respect to the length of the samples $n_{1},\dots,n_{N}$ .

Clearly, the problem of clustering in this formulation is a direct generalisation of the three-sample problem of Section 3. Indeed, the latter problem can be seen as clustering $N=3$ samples into $\kappa=2$ clusters, where $\kappa$ is given. At the same time, the discrimination problem of Section 4 can be seen as clustering $N=2$ samples into either $\kappa=1$ or $\kappa=2$ clusters, with $\kappa$ unknown.

Anticipating, from this we can already see when it is possible and when it is not possible to have a consistent algorithm for clustering stationary ergodic time series. {svgraybox} There exists a consistent algorithm for clustering stationary ergodic time series if and only if the number of clusters $\kappa$ is known.

We proceed below with a more formal problem formulation and the exposition of the algorithm.

1 Problem formulation

The clustering problem can be defined as follows. $N$ samples ${\bf x}_{1},\dots,{\bf x}_{N}$ are given, where each sample ${\bf x}_{i}$ is of length $n_{i}$ : ${\bf x}_{i}=X^{i}_{1..n_{i}}$ . The samples are generated by a distribution $P$ over $(A^{N})^{\infty}$ , that is, a distribution that generates an infinite sequence of $N$ -tuples.

[TABLE]

The marginal distribution of each sequence $X^{i}_{1..n_{i},..}$ is one out of $\kappa$ different (and unknown) stationary ergodic distributions $\rho_{1},\dots,\rho_{\kappa}\in\mathcal{E}$ . Note that we allow the samples ${\bf x}_{1},\dots,{\bf x}_{N}$ to be dependent; the only requirement is on the marginal distributions (they should be stationary ergodic). Thus, there is a partitioning ${\mathcal{G}}=\{{\mathcal{G}}_{1},\dots,{\mathcal{G}}_{\kappa}\}$ of the set $\{1..N\}$ into $\kappa$ disjoint subsets ${\mathcal{G}}_{j},j=1..\kappa$

[TABLE]

such that ${\bf x}_{j}$ , $1\leq j\leq N$ is generated by $\rho_{j}$ if and only if $j\in{\mathcal{G}}_{j}$ . The partitioning ${\mathcal{G}}$ is called the target (or ground-truth) clustering and the sets ${\mathcal{G}}_{i},1\leq i\leq\kappa$ , are called the target clusters. Given samples ${\bf x}_{1},\dots,{\bf x}_{N}$ and a target clustering ${\mathcal{G}}$ , let ${\mathcal{G}}({\bf x})$ denote the cluster that contains ${\bf x}$ .

A clustering function $F$ takes a finite number of samples ${\bf x}_{1},\dots,{\bf x}_{N}$ and a parameter $k$ (the target number of clusters) and outputs a partition $F({\bf x}_{1},\dots,{\bf x}_{N},(k))=\{T_{1},\dots,T_{k}\}$ of the set $\{1..N\}$ .

Definition 1.1 (asymptotic consistency).

Let a finite number $N$ of samples be given, and let the target clustering partition be ${\mathcal{G}}$ . Define $n=\min\{n_{1},\dots,n_{N}\}$ . A clustering function $F$ is strongly asymptotically consistent if

[TABLE]

from some $n$ on with probability 1. A clustering function is weakly asymptotically consistent if

[TABLE]

Note that the consistency is asymptotic with respect to the minimal length of the sample, and not with respect to the number of samples.

2 A clustering algorithm and its consistency

Here we present an algorithm that is shown to be asymptotically consistent in the general framework introduced. What makes this simple algorithm interesting is that it requires only $\kappa N$ distance calculations (where $\kappa$ is the number of clusters), that is, much less than is needed to calculate the distance between each two sequences.

In short, Algorithm 1 initialises the clusters using farthest-point initialisation, and then assigns each remaining point to the nearest cluster. More precisely, the sample ${\bf x}_{1}$ is assigned as the first cluster centre. Then a sample is found that is farthest away from ${\bf x}_{1}$ in the empirical distributional distance $\hat{d}$ and is assigned as the second cluster centre. For each $k=2..\kappa$ the $k^{\text{th}}$ cluster centre is sought as the sequence with the largest minimum distance from the already assigned cluster centres for $1..k-1$ . By the last iteration we have $\kappa$ cluster centres. (This initialisation procedure was proposed in Katsavounidis:94 in the context of $k$ -means clustering.) Next, the remaining samples are each assigned to the closest cluster.

Theorem 1.2.

*Algorithm 1 is strongly asymptotically consistent provided that the correct number $\kappa$ of clusters is known, and the marginal distribution of each sequence ${\bf x}_{i},i=1..N$ is stationary ergodic. *

To main idea of the proof is as follows. Lemma 1.2 implies that, if the samples in $S$ are long enough, the samples that are generated by the same process distribution are closer to each other than to the rest of the samples. Therefore, the samples chosen as cluster centres are each generated by a different process distribution. The theorem then follows from the fact that the algorithm assigns the rest of the samples to the closest clusters.

Proof 1.3.

Let $n$ denote the shortest sample length in $S$ :

[TABLE]

Denote by $\delta$ the minimum nonzero distance between the process distributions:

[TABLE]

Fix $\varepsilon\in(0,\delta/4)$ . Since there are a finite number $N$ of samples, by Lemma 1.2 for all large enough $n_{\min}$ we have

[TABLE]

where ${\mathcal{G}}_{k},~{}k=1..\kappa$ denote the ground-truth partitions. By (2) and applying the triangle inequality we obtain

[TABLE]

Thus, for all large enough $n_{\min}$ we have

[TABLE]

where the first inequality follows from the triangle inequality, and the second inequality follows from (2) and the definition of $\delta$ . In words, (3) and (1.3) mean that the samples in $\mathcal{S}$ that are generated by the same process distribution are closer to each other than to the rest of the samples. Finally, for all $n_{\min}$ large enough to have (3) and (1.3) we obtain

[TABLE]

*where, as specified by Algorithm 1, $c_{1}:=1$ and $c_{k}:=\underset{i=1..N}{\operatorname{argmax}}\displaystyle\min_{j=1..k-1}\hat{d}({\bf x}_{i},{\bf x}_{c_{j}}),~{}k=2..\kappa$ . Hence, the indices $c_{1},\dots,c_{\kappa}$ will be chosen to index sequences generated by different process distributions. To derive the consistency statement, it remains to note that, by (3) and (1.3), each remaining sequence will be assigned to the cluster centre corresponding to the sequence generated by the same distribution. *

3 Extensions: unknown $k$ , online clustering and clustering with respect to independence

In this section we briefly consider several extensions and modifications of the process clustering problem. The problems are only outlined, and the details are left out; the interested reader is referred to the corresponding papers that treat each of these problems in detail.

Unknown number of clusters

As mentioned in the beginning of this section, if the number of clusters $\kappa$ is unknown, then the problem provably has no solution. Thus, if we really want to have a consistent algorithm that does not require $\kappa$ , then something has to give in. Sacrificing the generality is one way of doing it. Clearly, if we assume that the speed of convergence of frequencies has a known upper-bound, as is the case when time-series are i.i.d. or mixing (with a bound on the mixing coefficient) then everything becomes possible. The resulting time-series clustering problem is still interesting, but clearly falls out of the scope of this volume. A simple example of an algorithm that is consistent in this setting can be found in Ryabko:10clust ; Khaleghi:15clust . It is worth noting that it remains open to establish tight upper- and lower-bounds on the error probability of clustering algorithms even for the case of i.i.d. time series.

Online clustering

An interesting and practical modification of the clustering problem consists in taking it “online.” On each time step, new samples are revealed, which can be either a continuation of some of the time-series available on the previous steps, or form a new time series. The asymptotic setting commands that the length of each time series should grow to infinity, as should the number of time series, though they may do so in an arbitrary manner. As before, the only requirement we would like to make is that the marginal distribution of each of the processes is stationary and ergodic. There are only $\kappa$ different marginal distributions, the number $\kappa$ of these distributions is known but this is all the information we get.

Let us describe the problem a little more formally. Consider the two-way infinite matrix ${\bf X}$ of $A$ -valued random variables

[TABLE]

generated by some probability distribution $P$ on $((A^{\infty})^{\infty},\mathcal{B}_{2})$ , where $\mathcal{B}_{2}$ is the corresponding Borel sigma-algebra. The matrix ${\bf X}$ can be seen as an infinite sequence of infinite sequences; since $(A,\mathcal{B}_{1})$ is a standard probability space, so is $(A^{\infty},\mathcal{B}_{\infty})$ and thus $((A^{\infty})^{\infty},\mathcal{B}_{2})$ is well-defined (e.g., Gray:88 ).

Assume that the marginal distribution of $P$ on each row of ${\bf X}$ is one of $\kappa$ unknown stationary ergodic process distributions $\rho_{1},\rho_{2},\dots,\rho_{\kappa}$ . Thus, the matrix $\bf{X}$ corresponds to infinitely many one-way infinite sequences, each of which is generated by a stationary ergodic distribution. Aside from this assumption, we do not make any further assumptions on the distribution $P$ that generates ${\bf X}$ . This means that the rows of ${\bf X}$ (corresponding to different time-series samples) are allowed to be dependent, and the dependence can be arbitrary; one can even think of the dependence between samples as adversarial. For notational convenience we assume that the distributions $\rho_{k},k=1..\kappa$ are ordered based on the order of appearance of their first rows (samples) in ${\bf X}$ .

As in the offline setting, the ground-truth partitioning of $\bf{X}$ is defined by grouping the rows that have the same marginal distribution. Let

[TABLE]

be a partitioning of $\mathbb{N}$ into $\kappa$ disjoint subsets ${\mathcal{G}}_{k},~{}k=1..\kappa$ , such that the marginal distribution of ${\bf x}_{i}$ , $i\in\mathbb{N}$ is $\rho_{k}$ for some $k\in 1..\kappa$ if and only if $i\in{\mathcal{G}}_{k}$ . The partitioning ${\mathcal{G}}$ is called the ground-truth clustering.

Introduce also the notation ${\mathcal{G}}|_{N}$ for the restriction of ${\mathcal{G}}$ to the first $N$ sequences:

[TABLE]

At every time step $t\in\mathbb{N}$ , a part $S(t)$ of ${\bf X}$ is observed corresponding to the first $N(t)\in\mathbb{N}$ rows of ${\bf X}$ , each of length $n_{i}(t),i\in 1..N(t)$ , i.e.

[TABLE]

We assume that the number of samples, as well as the individual sample-lengths grow with time. That is, the length $n_{i}(t)$ of each sequence ${\bf x}_{i}$ is nondecreasing and grows to infinity (as a function of time $t$ ). The number of sequences $N(t)$ also grows to infinity. Aside from these assumptions, the functions $N(t)$ and $n_{i}(t)$ are completely arbitrary.

An algorithm is called asymptotically consistent in the online setting, if, for every $N$ w.p.1 from some point on the clustering $C$ output by the algorithm coincides with the ground-truth on the first $N$ samples, i.e. $C|_{N}={\mathcal{G}}|_{N}$ .

It turns out that this setting admits a consistent clustering algorithm.

Theorem 1.4.

*There exists an algorithm that is asymptotically consistent in the online setting, provided that the marginal distribution of each sequence ${\bf x}_{i},i\in\mathbb{N}$ is stationary ergodic. *

The proof of this theorem, along with the corresponding algorithm, can be found in Khaleghi:15clust .

It is worth noting that the main challenge in constructing such an algorithm is the fact that, on every time step $t$ , we do not know whether all of the $\kappa$ different distributions are already present, or the $N(t)$ are generated by fewer than $\kappa$ different distributions. The solution is based on a weighted average of clusterings, each constructed based on the first $N$ rows, with carefully selected weights.

Clustering with respect to independence

The clustering problem considered in the previous sections may be seen as clustering with respect to distribution: putting together those and only those samples that are generated by the same distribution. Another way to look at clustering time series is grouping them with respect to (in)dependence. Thus, the problem is as follows. {svgraybox} Given a set $S=({\bf x}_{1},\dots,{\bf x}_{N})$ of samples, it is required to find the finest partitioning $\{U_{1},\dots,U_{k}\}$ of $S$ into clusters such that the clusters $U_{1},\dots,U_{k}$ are mutually independent.

The formal model is the same as in clustering with respect to distribution: the probability distribution is that on the space of infinite sequence of $N$ -tuples (1). However, in this setting we require the joint distribution to be stationary ergodic, whereas before we only had to put this constraint on the martinal distribution of the samples.

What makes this problem very different from the previous one, and, in fact, from the rest of the problems considered in the clustering literature, is that, since mutual independence is the target, pairwise similarity measurements are of no use. Therefore, traditional clustering algorithms are inapplicable, since they are based on calculating some distance between pairs of objects (in the case of the previous sections, time-series samples) ${\bf x}_{i},{\bf x}_{j}$ .

Thus, to solve this problem we have to go back to the first principles and first consider what should we do if the joint distribution of all the samples is known. After that, it is instructive to consider i.i.d. samples, before turning to stationary ergodic distributions. While the detailed considerations of this problem takes us outside the scope of this volume, here it is worth mentioning in which cases a solution to this problem exists, and some ideas behind it.

For stationary ergodic distributions a consistent algorithm can be constructed provided the correct number of clusters is known. The algorithm is based on calculating empirical estimates of the following measure of independence between groups of samples. In the expression below, $h()$ stands for Shannon entropy, and $[\cdot]^{l}$ is a quantization of the random variable in question to the cells of a partition similar to $B^{m,l}$ but finite.

Definition 1.5 (sum-information).

For stationary processes $x_{1},\dots,x_{k}$ define the sum-information

[TABLE]

This quantity has certain similarities to the distributional distance: it is also a weighted sum of certain discrepancies between marginal distributions of growing dimension. However, instead of simple differences in probabilities, we are using entropy, and whereas before we were considering only pairs of random variables, here we have generalized this to groups of arbitrary sizes. Note also that this is not an estimator but a theoretical quantity; to estimate it empirically, one replaces the probabilities $[X^{i}_{1..m}]^{l}$ with the corresponding frequencies.

The details of the algorithms and proofs can be found in Ryabko:17clin . It is worth noting that the online version of this problem (akin to the one considered in Section 3) so far remains unexplored.

2 Change-point problems

Change-point problems are concerned with sequences in which the distribution of the data changes over time in an abrupt manner. The latter means that the sequence can be divided into segments, such that each segment is generated by a single time-series distribution, and between the segments the distributions are different.

It is another classical problem, with vast literature on both parametric (see e.g. basseville:93 ) and non-parametric (see e.g. brodsky:93 ) methods for solving it. As usually in statistics, most literature deals with the case of i.i.d. data within each segment, with generalisations to dependent data reaching up to and including distributions with mixing brodsky:93 ; Giraitis:95 . The important exception is the work Carlstein:93 , which considers stationary ergodic sequences. The latter work makes a further assumption that the single-dimensional marginals (of $X_{i}$ ) before and after the change point are different. As was shown in Ryabko:103s , this assumption is not necessary; here, as in the preceding sections, we follow this latter approach.

Change-point problems can be roughly divided into estimation problems and detection problems. To better explain this, consider the case of a single change. A sample $Z_{1},\dots,Z_{n}$ is given, where, for a certain $\theta\in(0,1)$ , $Z_{1},\dots,Z_{\lfloor n\theta\rfloor}$ are generated according to some distribution $\rho_{X}$ and $Z_{{\lfloor n\theta\rfloor}+1},\dots,Z_{n}$ are generated according to some distribution $\rho_{Y}$ . Change-point estimation is about finding the parameter $\theta$ (or, equivalently, the change point ${\lfloor n\theta\rfloor}$ ), knowing that it exists, that is, knowing that $\rho_{Y}\neq\rho_{X}$ . On the other hand, detection problems are concerned with determining whether there is a change point in the first place, that is, finding out whether $\rho_{X}=\rho_{Y}$ . Various formulations exist, mainly focusing on detecting the change quickly after it appears.

Given the results of the preceding sections, it should be clear at this point that if all we know is that all the distributions in question are stationary and ergodic, then it is, in general, not possible to tell whether there is a change point in the sequence or not. Thus, we will be only concerned with change-point estimation problems.

Another point that needs to be clarified is the asymptotic regime that we are using. We are working with a single sample of a fixed size, $n$ ,

[TABLE]

yet the statements will be about what happens when $n$ goes to infinity. In fact, we are talking about two samples whose lengths grow to infinity. If we imagine them being stuck together and each increasing in length to the right, then this would somehow make the change point obvious each time the length of sample to the left of it increments. This is why we are not considering an “online” setting where the samples would grow. Rather, we are considering only an “offline” version, where the sample is fixed. In this setting, saying that, for example, the estimate $\hat{\theta}$ approaches $\theta$ as $n$ grows to infinity simply means that for large enough $n$ , $\hat{\theta}$ is arbitrarily close to $\theta$ , and does not mean that the algorithm is dealing with samples of increasing sizes.

An important constraint, which is present in one way or another in all the change-point models, is on how far a change point can be from the boundaries of the samples. Indeed, if, say $Z_{1}$ is generated by one distribution but already $Z_{2}$ by another, so the change point occurs at time step 2, then hardly any algorithm can make any meaningful inference. A common way to tackle this is to require the size of each segment (generated by a single distribution) to be linear in the length of the whole combined sample, $n$ . This is made explicit in the formulations we adopt, where we refer to change points as $\theta n$ , and the goal is to estimate $\theta$ . Moreover, given the fact that there are no speeds of convergence available for (stationary) ergodic distributions, this requirement is essential, since the initial $o(n)$ part of any sample can be effectively arbitrary, whatever function one assumes in that $o(n)$ . Thus, we can state the following. {svgraybox} Consistent change-point estimation algorithms for stationary ergodic processes are only possible under the constraint that the length of each segment generated by a single distribution is linear in the total sample size $n$ .

In this chapter we treat in detail the case of a single change point. Extensions to multiple change points are given without proofs, referring the interested reader to the corresponding papers. However, it is worth noting that, in spite of the impossibility of discriminating between processes and thus detecting a change point, the case of an unknown number of change points is not entirely hopeless, and in fact in some cases admits a solution that does not require putting further restrictions on the distributions generating the data.

1 Single change point

The sample $Z=(Z_{1},\dots,Z_{n})$ is the concatenation of two parts $X=(X_{1},\dots,X_{\lfloor n\theta\rfloor})$ and $Y=(Y_{1},\dots,Y_{m})$ , where $m=n-{\lfloor n\theta\rfloor}$ , so that $Z_{i}=X_{i}$ for $1\leq i\leq{\lfloor n\theta\rfloor}$ and $Z_{{\lfloor n\theta\rfloor}+j}=Y_{j}$ for $1\leq j\leq m$ . The samples $X$ and $Y$ are generated by two different stationary ergodic processes with alphabet $A$ . The distributions of the processes are unknown. The value ${\lfloor n\theta\rfloor}$ is called the change point. Moreover, in this first setting, we assume that $\theta$ is bounded away from 0 and from 1 with known upper and lower bounds: $\alpha n<k<\beta n$ for some known $0<\alpha\leq\beta<1$ (for sufficiently large $n$ ). In the next setting we shall discuss how to get rid of this assumption.

It is required to estimate $\theta$ (or, equivalently, the change point ${\lfloor n\theta\rfloor}$ ) based on the sample $Z$ .

For each $t$ , $1\leq t\leq n$ , denote $U^{t}$ the sample $(Z_{1},\dots,Z_{t})$ consisting of the first $t$ elements of the sample $Z$ , and denote $V^{t}$ the remainder $(Z_{t+1},\dots,Z_{n})$ .

Definition 2.1 (Change point estimator).

Define the change-point estimate $\hat{\theta}:A^{*}\rightarrow(0,1)$ as follows:

[TABLE]

The following theorem establishes asymptotic consistency of this estimator.

Theorem 2.2.

For the estimate $\hat{\theta}$ of the change point $\theta n$ we have

[TABLE]

*where $n$ is the size of the sample. *

Proof 2.3.

Denote $k:={\lfloor n\theta\rfloor}$ . To prove the statement, we will show that, for every $\gamma$ , $0<\gamma<1$ , with probability 1 the inequality $\hat{d}(U^{t},V^{t})<\hat{d}(X,Y)$ holds for each $t$ such that $\alpha k\leq t<\gamma k$ , possibly except for a finite number of times (in $n$ ). Thus we will show that linear $\gamma$ -underestimates occur only a finite number of times, and for overestimate it is analogous. Fix some $\gamma$ , $0<\gamma<1$ and $\varepsilon>0$ . Let $J$ be big enough to have $\sum_{i=J}^{\infty}w_{i}<\varepsilon/2$ and also big enough to have an index $j<J$ for which $\rho_{X}(B_{j})\neq\rho_{Y}(B_{j})$ . Take $M_{\varepsilon}\in\mathbb{N}$ large enough to have $|\nu(Y,B_{i})-\rho_{Y}(B_{i})|\leq\varepsilon/2J$ for all $m>M_{\varepsilon}$ and for each $i$ , $1\leq i\leq J$ , and also to have $|B_{i}|/m<\varepsilon/J$ for each $i$ , $1\leq i\leq J$ . This is possible since empirical frequencies converge to the limiting probabilities a.s.; note that $M_{\varepsilon}$ depends on $Y_{1},Y_{2},\dots$ (cf. the proof of Lemma 1.2). Find a $K_{\varepsilon}$ (that depends on $X$ ) such that for all $k>K_{\varepsilon}$ and for all $i$ , $1\leq i\leq J$ we have

[TABLE]

(this is possible simply because $\alpha n\to\infty$ ). Furthermore, we can select $K_{\varepsilon}$ large enough to have

[TABLE]

for each $s\leq\gamma k$ : this follows from (7) and the identity

[TABLE]

So, for each $s\in[\alpha n,\gamma k]$ we have

[TABLE]

for $k>K_{\varepsilon}$ and $m>M_{\varepsilon}$ (from the definitions of $K_{\varepsilon}$ and $M_{\varepsilon}$ ). Hence

[TABLE]

for some $\delta_{j}$ that depends only on $k/m$ and $\gamma$ . Summing over all $B_{i}$ , $i\in\mathbb{N}$ , we get

[TABLE]

*for all $n$ such that $k>K_{\varepsilon}$ and $m>M_{\varepsilon}$ , which is positive for small enough $\varepsilon$ . *

2 Multiple change points, known number of change points

The following generalization is considered in this section. First, the number of change points is allowed to be arbitrary, though it still has to be known. Second, we get rid of the assumption that there is a known lower bound on the distance between a change point and the sequence boundaries (its start and its end), as well as between change points.

The details of the algorithm and the proof of its consistency are omitted and can be found in Khaleghi:15chp .

The problem is as follows. A sample

[TABLE]

is given, which is formed as the concatenation of $\kappa+1$ non-overlapping segments, where $\kappa\in\mathbb{N}$ and $0<\theta_{1}<\dots<\theta_{\kappa}<1$ . Each segment is generated by some unknown process distribution. The distributions that generate every pair of consecutive segments are different. The parameters $\theta_{k},~{}k=1..\kappa$ specifying the change points $\lfloor n\theta_{k}\rfloor$ are unknown and have to be estimated. The distributions that generate the segments are unknown, but are assumed to be stationary ergodic. A formal probabilistic model for this process is via considering the matrix of random variables (1), where the marginal distribution of each row is stationary ergodic. The sample ${\bf x}$ is then formed by concatenating parts of these rows.

Denote for convenience $\theta_{0}:=1$ and $\theta_{\kappa+1}:=n$ and define the minimal distance between change points as

[TABLE]

Let us first assume that there is a known lower bound $\lambda>0$ on this parameter: $\lambda<\lambda_{\min}$ . Then, knowing this lower bound and the number of change points $\kappa$ , one can construct a consistent algorithm as follows.

Break the whole sample ${\bf x}$ into short consecutive segments each of which cannot contain more than one change point (the actual algorithm, proposed in Khaleghi:15chp , uses segments of length $n\lambda_{\min}/3$ ). Find a candidate change point in each of the segments, using the single-change-point algorithm of the previous section. Then, select $\kappa$ of these candidate change-points that maximize the following scoring function. The scoring function $\Delta_{{\bf x}}(a,b)$ takes an arbitrary segment $(a,b)$ in the sample and measures how close to each other (in the distributional distance) its first and second halves are

[TABLE]

The reason the algorithm works is as follows. The single-change point estimates are consistent in the case there is exactly one change point in the segment they are applied to; this can be demonstrated in the same way the single change-point estimator was proven consistent in the previous section. Next, the segments that do not contain any change point will see their score (10) converge to 0, while those that do contain a change point, to a non-zero constant. Since we know how many change points there are, it suffices to select the $\kappa$ highest-scoring ones.

The next step is to get rid of the requirement of a known bound $\lambda$ on $\lambda_{\min}$ . This is done by constructing a series of $\kappa$ -tuples of change-point estimators, each for a different value of a candidate $\lambda_{\min}$ , which are then combined with carefully selected weights. This gives the following theorem.

Theorem 2.4.

*There exists an algorithm for finding $\kappa$ change points that is asymptotically consistent provided each segment is generated by a stationary ergodic distribution and $\kappa$ is known. *

{svgraybox}

For stationary ergodic time series, asymptotically consistent estimation of multiple change points whose number is known is possible without any extra assumptions, besides that the length of each segment is linear in the sample size $n$ .

3 Unknown number of change points

The result on impossibility of process discrimination (Section 4) implies that it is provably impossible to distinguish between the cases of 0 and 1 change point for stationary ergodic samples. Yet, it appears impractical to assume that the exact number of change points is given to an algorithm. Thus, a search for other, more constrained, formulations is warranted. Two such formulations are briefly considered here: providing an exhaustive list of change points, and the case of a known number of different distributions but an unknown number of change points. The details of the algorithms and proofs are left out, and can be found in the corresponding papers Khaleghi:12mchp ; Khaleghi:14 . In both of these formulations we assume a known lower bound on the distance $\lambda_{\min}$ between the change points (9).

It is worth making a distinction with the related problem of clustering. In that problem, if the number of clusters $k$ is unknown, then all we can do is to resort to more restrictive assumptions on the process distributions (see Section 3). On the other hand, for the change-point problem, it is still possible to get around the fact that the number of change points $k$ is unknown, while only assuming that the process distributions are stationary ergodic. Specifically, one formulation that allows us to do it is assuming that the total number of distributions is known (Section 3 below). Indeed, the number of distributions defines the number of clusters in the clustering problem, but, in change-point problems, still allows the number of change points to be arbitrary.

Listing change points

Not knowing the number of change points, one could try to provide a ranked list of change points, that should include all the “true” change points, and possibly also other, spurious, points. The longest such a list could be is $n$ , the size of the sample; or $\lceil 1/\lambda\rceil$ if we assume that the minimum distance between the change points is lower-bounded by $\lambda$ . Such a ranked list could be useful if we knew that the first $\kappa$ listed change points are the true change points, even if the rest of the listed points are extraneous. It turns out that this is indeed achievable. The algorithm is very similar to the one of the preceding section, with (10) used as a ranking function and the single-change-point algorithm used to find candidate change points in each of the segments. The result one can obtain is thus the following Khaleghi:12mchp .

Theorem 2.5.

*There exists an algorithm that, given a sample ${\bf x}$ (8) generated by stationary ergodic distributions, provides a list of change-point candidates which has the property that, with probability 1 as $n$ goes to infinity, from some $n$ on its first $\kappa$ elements are within $o(n)$ of the change points $\theta_{i}$ , $i=1..\kappa$ . *

Known number of distributions, unknown number of change points

A sample with $\kappa$ change points can be, in general, generated by $\kappa+1$ different distributions. However, it can be generated by fewer distributions too, for example, by two distributions only irrespective of the value of $\kappa$ . This formulation with the total number of distributions $r$ smaller than $\kappa+1$ may make sense in various applications. For example, imagine a text written by two authors each of which wrote many different parts of the text. Here the number of distributions is 2 and is known a priori, but the number of change points may be large and unknown.

It turns out that if the number of change points is unknown, it is still possible to locate them, if the total number of distributions generating the segments is known. Here as well we assume a known lower-bound $\lambda$ on the minimal distance between change points. The algorithm starts by producing an exhaustive list of $1/\lambda$ change points with the algorithm of the previous section. It then clusters all the resulting segments of the sample into $r$ clusters, where $r$ is the number of different distributions that is assumed given. The clustering algorithm can be chosen to be that of Section 2, with $r$ as the target number of clusters. This result in the following statement.

Theorem 2.6.

There exists an algorithm that, given a sample ${\bf x}$ (8) generated by $r$ different stationary ergodic distributions, the number $r$ and a lower-bound $\lambda$ on $\lambda_{\min}$ , provides an estimate $\hat{\kappa}\in[1..n]$ and a list of change points estimates $\theta_{i}$ , $i=1..\hat{\kappa}$ that are asymptotically consistent:

[TABLE]

and

[TABLE]

*for $i=1..\kappa.$ *

The details of the algorithm and proofs can be found in Khaleghi:14 .

In conclusion, we can formulate the following statement. {svgraybox} For stationary ergodic distributions generating the data between change points whose number is unknown, it is possible to find the correct number and provide consistent estimates of the change points if (and only if) the total number of different distributions is known.

Chapter 4 Hypothesis testing

Given a sample $X_{1},\dots,X_{n}$ , we wish to decide whether it was generated by a distribution belonging to a family $H_{0}$ , versus it was generated by a distribution belonging to a family $H_{1}$ . As before, the only assumption we are willing to make about the the distribution generating the sample is that it is stationary ergodic.

In this chapter where we assume that $X_{i}$ are from a finite alphabet $A$ . Moreover, unlike in the previous chapters, in this one we shall delve a little deeper into the theory of stationary processes, and use some of its facts other than the simple convergence of frequencies. In particular, it will be of essence that the space $\mathcal{S}$ of stationary processes is compact with the topology of the distributional distance: a fact that holds for finite-alphabet processes with distance (4) but not for real-valued processes with the distance (5).

The material of this chapter mainly follows Ryabko:121c ; Ryabko:141u .

1 Introduction

A test is a function that takes a sample and gives a binary (possibly incorrect) answer: the sample was generated by a distribution from $H_{0}$ or from $H_{1}$ . An answer $i\in\{0,1\}$ is correct if the sample is generated by a distribution that belongs to $H_{i}$ , and otherwise the test is said to make an error. It often makes sense to distinguish between two types of error, depending on which of the hypotheses holds true. Thus, we say that the test makes a Type I error if $H_{0}$ is true but the test says $H_{1}$ is true, and we say that the test makes Type II error if the opposite takes place: the test says $H_{0}$ while $H_{1}$ is true. Note that in case neither $H_{0}$ nor $H_{1}$ holds true the output of the test may be arbitrary and we are not speaking about any kind of error; generally, one cannot say anything about the behaviour of the test in such a case.

Here we are concerned with the general question of characterizing those pairs of $H_{0}$ and $H_{1}$ for which consistent tests exist.

Several notions of consistency are considered. For two of these notions of consistency we find some necessary and some sufficient conditions for the existence of a consistent test, expressed in topological terms. The topology is that of distributional distance in the form (4). For one notion of consistency, namely, for asymmetric consistency, the necessary and sufficient conditions coincide when $H_{1}$ is the complement of $H_{0}$ , thereby providing a complete characterization. This suggests that the topology of the distributional distance is indeed the right one to study these problems.

Each of the notions of consistency considered has been studied extensively (sometimes in slightly different formulations) for i.i.d. data. It is thus instructive to provide characterisations of those hypotheses for which consistent tests exist for this more restrictive model and see how it relates to the general case of stationary ergodic time series, which we do in this chapter whenever possible.

In the rest of this section we consider various examples of the problem of hypothesis testing that motivate studying it in the general form; we also introduce various notions of consistency used. In the next section, a simple example of hypothesis testing is considered in some detail, exposing various concepts used, including the notions of consistency, the topological criteria for consistency in simpler spaces and the role of the ergodic decomposition.

1 Motivation and examples

Before introducing the definitions of consistency, let us give some examples motivating the general problem in question. Most of these examples are classical problems studied in mathematical statistics and related fields, mostly for i.i.d. data, with much literature devoted to each of them. The classical Neyman-Pearson formulation of the hypothesis testing problem is testing a simple hypothesis $H_{0}=\{\rho_{0}\}$ versus a simple hypothesis $H_{1}=\{\rho_{1}\}$ , where $\rho_{0}$ and $\rho_{1}$ are two distributions that are completely known. A more complex but more realistic problem is when only one of the hypothesis is simple, $H_{0}=\{\rho_{0}\}$ but the alternative is general, for example, in our framework $H_{1}$ could be the set of all stationary ergodic processes that are different from $\rho_{0}$ . This is the so-called goodness-of-fit or identity-testing problem. Here $\rho_{0}$ would typically be some specific distribution of interest, such as the Bernoulli i.i.d. distribution with equal probabilities of outcomes.

Generalizing the latter example is the class of hypothesis testing problems that can be described as model verification problems. Suppose we have some relatively simple (possibly parametric) set of assumptions, and we wish to test whether the process generating the given sample satisfies this assumptions. As an example, $H_{0}$ can be the set of all $k$ -order Markov processes (fixed $k\in\mathbb{N}$ ) and $H_{1}$ is the set of all stationary ergodic processes that do not belong to $H_{0}$ ; one may also wish to consider more restrictive alternatives, for example $H_{1}$ is the set of all $k^{\prime}$ -order Markov processes for some $k^{\prime}>k$ . Of course, instead of Markov processes one can consider other models, e.g. hidden Markov processes. A similar problem is that of testing that the process has entropy less than some given $\varepsilon$ versus its entropy exceeds $\varepsilon$ , or versus its entropy is greater than $\varepsilon+\delta$ for some positive $\delta$ .

Yet another type of hypothesis testing problems concerns property testing. Suppose we are given two samples, generated independently of each other by stationary ergodic distributions, and we wish to test the hypothesis that they are independent versus they are not independent. Or, that they are generated by the same process versus they are generated by different processes.

In all the considered cases, when the hypothesis testing problem turns out to be too difficult (i.e. there is no consistent test for the chosen notion of consistency) for the case of stationary ergodic processes, one may wish to restrict either $H_{0}$ , $H_{1}$ or both $H_{0}$ and $H_{1}$ to some smaller class of processes. Thus, one may wish to test the hypothesis of independence when, for example, both processes are known to have finite memory, but the alternative is allowed to be general: the complement of the set $H_{0}$ to the set $\mathcal{E}$ of stationary ergodic processes (on pairs).

2 Types of consistency

There are different types of consistency of tests, corresponding to how strong a guarantee one wishes to have on the probability of error. Three notions of consistency are considered here: uniform, asymmetric (or $\alpha$ -level), and asymptotic consistency. They represent different trade-offs between the strength of the guarantees one can obtain and the generality of hypotheses pairs for which consistent tests exist.

1 Uniform consistency

We start with what appears the strongest notion, uniform consistency. It requires both probabilities of error to be uniformly bounded. More precisely, uniform consistency requires that for each $\alpha$ there exist a sample size $n$ such that probability of error is upper-bounded by $\alpha$ for samples longer than $n$ .

Definition 2.1 (uniform consistency).

*A test $\varphi$ is called uniformly consistent if for every $\alpha$ there is an $n_{\alpha}\in\mathbb{N}$ such that for every $n\geq n_{\alpha}$ the probability of error on a sample of size $n$ is less than $\alpha$ : $\rho(X\in A^{n}:\varphi(X)=i)<\alpha$ for every $\rho\in H_{1-i}$ and every $i\in\{0,1\}$ . *

This notion of consistency has been extensively studied in the algorithms community for i.i.d. data under a slightly different formulation: the probability of each error is required to be bounded by a fixed number, typically 1/3, and the problem is to find minimal sample sizes necessary to achieve this error. The interpretation is that if one can get 1/3 probability of error then one can make it arbitrary small by taking more (independent) samples; see, for example, Goldreich:98 ; Batu:01 ; Batu:04 ; Guha:06 . The definition above is adapted for dependent data.

For i.i.d. samples it is easy to establish a criterion for the existence of a consistent test: there exists a uniformly consistent test if and only if $H_{i},i=\{0,1\}$ are contained in closed non-overlapping sets. Here the topology is just that of the Euclidean distance on the space of parameters defining the distributions over $A$ . Indeed, to see that the condition is necessary, it is enough to notice that the sets of distributions $\rho$ satisfying $\rho(B)\leq\alpha$ are closed for any fixed $B\in A^{*}$ and $\alpha\in[0,1]$ , in particular for $B:=\{z_{1..n}:\varphi(z_{1..n})=0\}$ . On the other hand, to construct a test it is enough to take a neighbourhood over (say) $H_{0}$ of radius that slowly decreases with $n$ : for large enough $n$ the neighbourhood will not intersect $H_{1}$ (since both sets are closed), and one can use concentration of measure results for i.i.d. distributions to show that if the radius decreases slow enough then the test is consistent. From this description it is clear that the some generalizations to processes with mixing are possible. See also Csiszar:04 for related results (for i.i.d. data).

2 Asymmetric consistency

The next notion of consistency is the classical one used in mathematical statistics (e.g.,Lehmann:86 ; Kendall:61 ): the probability of Type I error is fixed at the given level $\alpha$ , and the probability of Type II error goes to 0. This definition is well-suited for pair of hypotheses that are by nature asymmetric, such as singleton $H_{0}$ , or hypotheses where $H_{1}$ is the complement to $H_{0}$ , for example, “the distribution belongs to a given parametric model” versus ”it is stationary ergodic but not in the model,” or the examples considered in this work: “distributions generating a pair of samples are independent” versus they are not, or “distributions are the same” versus they are not. The definition is as follows.

Definition 2.2 (Asymmetric consistency).

Call $\alpha$ -level test $\psi^{\alpha},\alpha\in(0,1)$ asymmetrically consistent as a test of $H_{0}$ against $H_{1}$ if:

(i)

The probability of Type I error is always bounded by $\alpha$ : $\rho\{X\in A^{n}:\psi^{\alpha}(X)=1\}\leq\alpha$ for every $\rho\in H_{0}$ , every $n\in\mathbb{N}$ and every $\alpha\in(0,1)$ , and

(ii)

Type II error is made not more than a finite number of times with probability 1: $\rho(\lim_{n\rightarrow\infty}\psi^{\alpha}(X_{1..n})=1)=1$ for every $\rho\in H_{1}$ and every $\alpha\in(0,1)$ .

Similar to the case of uniform consistency, here it is easy to see what is the criterion for the i.i.d. samples. There exists an asymmetrically consistent test if and only if ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ does not intersect $H_{1}$ ; see the next section for a more detailed explanation, and also Csiszar:04 for this and related results.

3 Asymptotic consistency

Finally, what appears to be the weakest notion of consistency is perhaps the simplest to formulate: the error (of each type) has to be made finitely many times w.p.1.

Definition 2.3 (asymptotic consistency).

*A test $\varphi$ is called uniformly consistent if for every $\rho\in H_{i}$ , $i=0,1$ we have $\lim_{n\to\infty}\varphi(z_{1..n})=i$ $\rho$ -a.s. *

This weakest notion of consistency gives strongest negative results, which is why we used it in the Section 4 to show that there is no consistent test for homogeneity (process discrimination).

For real-valued i.i.d. samples, where the hypotheses are formulated about the means of the distributions, this notion has been studied in Dembo:94 . For the case of distributions with finite $p$ moments with $p>1$ the following criterion is obtained: there exists an asymptotically consistent test if and only if $H_{0}$ and $H_{1}$ are contained in disjoint $F_{\sigma}$ sets. (A set is $F_{\sigma}$ if it is a countable union of closed sets.) It can be seen that the same criterion holds in our case (finite-valued distributions) if the samples are i.i.d.

This notion of consistency has been given considerable attention in the time-series literature, perhaps because it is rather weak and thus appears more suited for time-series analysis. In particular, some specific hypotheses have been studied in Ornstein:90 ; Morvai:05 . For the general case of stationary ergodic distributions, Nobel:06 obtains a generalization of the results of Dembo:94 , providing some sufficient conditions for the existence of a consistent test for real-valued processes, in terms of the topology of weak convergence.

4 Other notions of consistency

Many other notions of consistency exist in statistics and related fields. For example, a variation on the notion of asymmetric consistency common in the literature is requiring the probability of Type I error to be bounded by $\alpha$ only in asymptotic. Most of other notions of consistency are focussed on speeds of convergence and thus are of little interest in our context. For example, one can require the probability of error (of each type) to decrease exponentially fast; see Csiszar:04 for some characterisations.

3 One example that explains hypotheses testing

Let us consider a rather simple example that illustrates various concepts used and difficulties encountered. The example will be that of homogeneity testing (or process discrimination) for binary-valued ( $A=\{0,1\}$ ) processes; we will consider i.i.d. processes and Markov chains, in addition to stationary ergodic distributions. For the i.i.d. case, it is easy to find a e topological characterisation of those hypotheses for which consistent tests exist, so we do this for illustrative purposes. The example hypothesis considered here, homogeneity testing, is the problem we have addressed in Section 4 for asymptotic consistency in the general case. Here the main focus is on a stronger notion of consistency, namely asymmetric consistency, and on simpler processes. The goal is to illustrate the topological conditions that characterize the existence of consistent tests. The Markov case already shows why ergodic decomposition plays such an important role in finding the criteria for the existence of tests.

1 Bernoulli i.i.d. processes

Before considering dependent time series, let us see what would be the criterion for the existence of an asymmetrically consistent test for i.i.d. data, and apply it to our example of homogeneity testing.

Thus, we are speaking about Bernoulli distributions. Each such distribution $\rho$ can be identified with the parameter $\rho(X_{1}=0)\in[0,1]$ , and each hypothesis $H_{i}$ with a subset of the parameter space $[0,1]$ . Recall that a test $\varphi_{\alpha}({\bf x})$ , which receives an additional parameter $\alpha\in(0,1)$ , is said to be asymmetrically consistent, if, for every sample size $n$ the and every probability of Type I error (that is, error under $H_{0}$ ) is upper-bounded by $\alpha$ , while the probability of Type II error (error under $H_{1}$ ) goes to 0. It is easy to see that there exists an asymmetrically consistent test if and only if ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ does not intersect $H_{1}$ . Here the topology is just that of the Euclidean distance on the parameter space. Indeed, to see that the condition is necessary, it is enough to notice that the sets of distributions $\rho$ satisfying $\rho(B)\leq\alpha$ are closed for any fixed $B\in A^{*}$ and $\alpha\in[0,1]$ , and in particular for $B:=\{z_{1..n}:\varphi(z_{1..n})=1\}$ . Thus, if, for the given sample size $n$ , the probability that the test says $H_{1}$ is upper-bounded by $\alpha$ for every $\rho\in H_{0}$ (Type I error) then the same holds for every $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . We have shown that it is necessary for $H_{0}$ to be closed in order for an asymmetrically consistent test against the complement to exist. To show sufficiency, we need to construct a test for an arbitrary closed $H_{0}$ . To do so, consider a closed set $H_{0}$ and the closed set $C\subset[0,1]$ of parameters that defines it. Take a sequence of neighbourhoods $C_{n}$ over $C$ of such radii that, for every $n$ and $\rho\in H_{0}$ , the probability of samples of size $n$ that the frequency of 0 falls into $C_{n}$ equals $\alpha$ . Note that the radius of these neighbourhoods decreases with $n$ (because of the law of large numbers), which means that for every distribution $\rho\in H_{1}$ there is a large enough $n$ such that (the parameter that defines) $\rho$ is outside $C_{n}$ . This implies that the Type II error goes to 0.

The hypothesis of homogeneity is formulated for ${\bf x}_{1},{\bf x}_{2}$ and states that their distributions $\rho_{1},\rho_{2}$ are equal. Thus, we are speaking about distributions on pairs of samples (which, for the sake of simplicity, we consider independent). For Bernoulli distributions, this is a two-parameter space $[0,1]^{2}$ . The hypothesis $H_{0}$ is the diagonal $\{(x,x):x\in[0,1]\}$ , which is of course closed, and so a consistent test exists. Similarly, for uniform consistency the criterion is that ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap{\operatorname{\scriptstyle\texttt{closure}}}(H_{1})=\varnothing$ . Thus, there is no uniformly consistent test for homogeneity, and, more generally, there is no uniformly consistent test for any $H_{0}$ against its complement. If we want to have a uniformly consistent test for homogeneity, we need to change the alternative hypothesis $H_{1}$ . For example, change $H_{1}$ to “the distributions differ by at least $\varepsilon$ .” This ensures the existence of a uniformly consistent test at the cost of creating an $\varepsilon$ -buffer zone between $H_{0}$ and $H_{1}$ , in which, in general, we cannot say anything about the behaviour of a test.

2 Markov chains

Moving on to the case of two-state Markov chains, we have now two $[0,1]$ -valued parameters: the probabilities to change the state. As before, the state space is binary: $A=\{0,1\}$ . Let us try to guess that the criterion for the existence of an asymmetrically consistent test is the same as in the i.i.d. case: there exists a consistent test iff ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap H_{1}=\varnothing$ , with the Euclidean topology of the parameter space, and let us look what it gives for the hypothesis of homogeneity. Consider a specific set of Markov chains, call it $m_{\varepsilon}$ . These are defined so that the probability to change a state (from 0 to 1 as well as from 1 to 0) for the chain $m_{\varepsilon}$ is $\varepsilon$ , and the initial distribution is given by $m_{\varepsilon}(X_{1}=0)=1/2$ . When $\varepsilon$ goes to [math], the limit of $m_{\varepsilon}$ (in the space $[0,1]^{2}$ of parameters) is $m_{0}$ . The latter is a stationary distribution which is a mixture of two Dirac distributions $\delta_{0}$ and $\delta_{1}$ : one concentrated on the sequence of 0s and the other on the sequence of 1s. This is the ergodic decomposition of $m_{0}$ : $m_{0}=1/2(\delta_{0}+\delta_{1})$ . Note that $m_{\varepsilon}$ for $\varepsilon>0$ are stationary and ergodic, but $m_{0}$ is stationary but not ergodic. And here lies the source of the trouble. For the hypothesis of homogeneity, consider the pair of distributions $(m_{\varepsilon},m_{\varepsilon})$ . When $\varepsilon\to 0$ , the limit is $(m_{0},m_{0})$ , which is the mixture

[TABLE]

Call this mixture $W_{0}$ . Note that, under the distribution $(m_{0},m_{0})$ , with probability 1/2 we observe two different sequences, one is all 0s and the other all 1s. In other words, under the ergodic decomposition $W_{0}$ of $(m_{0},m_{0})$ , with probability 1/2 we observe two different distributions, either $(\delta_{0},\delta_{1})$ or $(\delta_{1},\delta_{0})$ , so that $W_{0}(H_{1})=1/2$ . Nonetheless, the distribution $(m_{0},m_{0})$ itself is of course in $H_{0}$ .

Let us now demonstrate that there is no asymmetrically consistent test for $H_{0}$ against its complement to the set of Markov chain distributions. As in the i.i.d. case, the sets of distributions $\rho$ satisfying $\rho(B)\leq\alpha$ are closed for any fixed $B\in A^{*}$ and $\alpha\in[0,1]$ , and in particular for $B:=\{z_{1..n}:\varphi(z_{1..n})=0\}$ . Thus, for any test $\varphi$ and any given sample size $n$ , if the sets $\{(X_{1..n},Y_{1..n})\in(A^{n})^{2}:\varphi_{\alpha}(X_{1..n}=1)\}$ on which the test says $H_{1}$ (makes Type I error) have probability at most $\alpha$ with respect to every $(m_{\varepsilon},m_{\varepsilon})$ for $\varepsilon>0$ , then they also have probability at most $\alpha$ under the distribution $(m_{0},m_{0})$ . The latter distribution, however, is concentrated on four pairs of $n$ -tuples $(000..0,111..1),(111..1,111..1)(111..1,000..0)(000..0,000..0)$ . This means that for $\alpha<1/4$ the test must say that the distributions are the same when presented with at least one of the pairs of samples $(000..0,111..1)$ or $(111..1,000..0)$ . Since this happens for every $n$ , we conclude that any such test is inconsistent: its Type II error does not go to 0: it is at least 1/4 for infinitely many $n$ under at least one of the distributions $(\delta_{0},\delta_{1})$ or $(\delta_{1},\delta_{0})$ .

Thus, we have shown that there is no asymmetrically consistent test for homogeneity for (stationary ergodic) Markov chains. The reason for this is that, while the set $H_{0}$ is closed, it is not closed under ergodic decompositions. Specifically, there exists a distribution $\rho\in H_{0}$ (namely, $\rho=(m_{0},m_{0})$ ), whose ergodic decomposition $W_{0}$ is such that $W_{0}(H_{1})=1/2$ . Ergodic decompositions of the limit points of $H_{0}$ is what we need to take care of in the general case of stationary ergodic distributions.

As the last word about homogeneity testing for Markov chains, let us note that, unlike for stationary ergodic distributions, there exists an asymptotically consistent test for this hypothesis for this set of processes. Indeed, ergodic Markov chains mix exponentially fast (e.g., hernandez:03 ), which is enough to construct a test, considering sets around $H_{0}$ that shrink sufficiently slowly. An example of such an algorithm for the more general problem of clustering distributions with mixing can be found in Khaleghi:15clust .

3 Stationary ergodic processes

Finally, let us pass to the general case of stationary ergodic distributions. The topology of the distributional distance that we work with is a direct generalisation of the Euclidean topology of the parameter spaces on the Bernoulli and Markov distributions that we considered. In fact, the topology induced by the distributional distance on these parameter spaces is exactly the same.

As we have seen in the Markov case, the main problem is with the limit points of $H_{0}$ and their ergodic decompositions. More generally, while the set $\mathcal{S}$ of stationary processes is closed in the topology of the distributional distance, the set $\mathcal{E}$ of stationary ergodic distributions is not (its closure is $\mathcal{S}$ ). This parallels the situation with Markov chains: the closure of the set of stationary ergodic Markov chains is the set of all stationary Markov chains.

For the case of asymmetric consistency for stationary ergodic processes, the pinnacle result presented in this chapter is the following criterion: there exists an asymmetrically consistent test of $H_{0}\subset\mathcal{E}$ against its complement $H_{1}=\mathcal{E}\backslash H_{0}$ if and only if $H_{0}$ has probability 1 with respect to the ergodic decomposition of every process in the closure of ${H_{0}}$ . This is a corollary of the more general result presented in this chapter for the case when $H_{0}$ is not necessarily the complement of $H_{1}$ ; however the condition only becomes “if and only if” in the case of the complement. This result can be directly applied to the hypothesis of homogeneity testing to show that there is no asymmetrically consistent test against its complement: indeed, the proof that $H_{0}$ is not closed under taking ergodic decompositions is by the Markov example of the previous subsection.

4 Topological characterizations

In this section we formulate our criteria for the existence of consistent tests, and give constructions of the tests which are consistent if and only if consistent tests exist.

These constructions are not exactly algorithms, since one can hardly talk about algorithms whose input is an arbitrary set of distributions. However, the tests specify what should be estimated and how the decision should be made. Therefore, we provide procedures that work if anything works at all; turning them into efficient algorithms for specific problems is an interesting direction for further research.

The tests presented below are based on empirical estimates of the distributional distance. We shall first generalize this to measure the distance between a sample and a set of distributions (a hypothesis), rather than a single distribution or another samples.

For a sample $X_{1..n}\in A^{n}$ and a hypothesis $H\subset\mathcal{E}$ define

[TABLE]

For $H\subset\mathcal{S}$ , denote ${\operatorname{\scriptstyle\texttt{closure}}}(H)$ the closure of $H$ with respect to the topology of $d$ .

1 Uniform testing

For $H_{0},H_{1}\subset\mathcal{S}$ , the uniform test $\varphi_{H_{0},H_{1}}$ is constructed as follows. For each $n\in\mathbb{N}$ let

[TABLE]

Since the set $\mathcal{S}$ is a complete separable metric space, it is easy to see that the function $\varphi_{H_{0},H_{1}}(X_{1..n})$ is measurable provided ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ is measurable.

Theorem 4.1 (uniform testing).

*Let $H_{0},H_{1}$ be measurable subsets of $\mathcal{E}$ . If $W_{\rho}(H_{i})=1$ for every $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{i})$ then the test $\varphi_{H_{0},H_{1}}$ is uniformly consistent. Conversely, if there exists a uniformly consistent test for $H_{0}$ against $H_{1}$ then $W_{\rho}(H_{1-i})=0$ for any $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{i})$ . *

The proof is deferred to section 5.

The following corollary, which is easy to see already for i.i.d. distributions (see Section 3), for the general case is an immediate consequence of the second statement of the theorem above.

Corollary 4.2.

*There is no uniformly consistent test for any hypothesis $H_{0}$ against its complement $\mathcal{E}\setminus H_{0}$ unless one of these hypotheses is empty. *

2 Asymmetric testing

Construct the asymmetric test $\psi_{H_{0},H_{1}}^{\alpha},\alpha\in(0,1)$ as follows. For each $n\in\mathbb{N}$ , $\delta>0$ and $H\subset\mathcal{E}$ define the neighbourhood $b^{n}_{\delta}(H)$ of $n$ -tuples around $H$ as

[TABLE]

Moreover, let

[TABLE]

be the smallest radius of a neighbourhood around $H$ that has probability not less than $\theta$ with respect to any process in $H$ , and let $C^{n}(H,\theta):=b^{n}_{\gamma_{n}(H,\theta)}(H)$ be the neighbourhood of this radius. Define

[TABLE]

Again, it is easy to see that the function $\varphi_{H_{0},H_{1}}(X_{1..n})$ is measurable, since the set $\mathcal{S}$ is separable.

Theorem 4.3.

*Let $H_{0},H_{1}$ be measurable subsets of $\mathcal{E}$ . If $W_{\rho}(H_{0})=1$ for every $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ then the test $\psi^{\alpha}_{H_{0},H_{1}}$ is asymmetrically consistent. Conversely, if there is an asymmetrically consistent test for $H_{0}$ against $H_{1}$ then $W_{\rho}(H_{1})=0$ for any $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . *

For the case when $H_{1}$ is the complement of $H_{0}$ the necessary and sufficient conditions of Theorem 4.3 coincide and give the following criterion.

Corollary 4.4.

Let $H_{0}\subset\mathcal{E}$ be measurable and let $H_{1}=\mathcal{E}\backslash H_{0}$ . The following statements are equivalent:

(i)

There exists an asymmetrically consistent test for $H_{0}$ against $H_{1}$ .

(ii)

The test $\psi^{\alpha}_{H_{0},H_{1}}$ is asymmetrically consistent.

(iii)

The set $H_{1}$ has probability 0 with respect to the ergodic decomposition of every $\rho$ in the closure of $H_{0}$ : $W_{\rho}(H_{1})=0$ for each $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ .

{svgraybox}

There exists an asymmetrically ( $\alpha$ -level) consistent test for a hypothesis $H_{0}\subset\mathcal{E}$ against its complement $\mathcal{E}\setminus H_{0}$ if and only if $H_{0}$ is closed and closed under taking ergodic decompositions, in the sense that $W_{\rho}(H_{0})=1$ for every $\rho$ in the closure of ${H_{0}}$ .

5 Proofs

In the proofs, we often omit the subscript $H_{0},H_{1}$ from $\psi^{\alpha}_{H_{0},H_{1}}$ when it can cause no confusion.

The proofs use the following lemmas.

Lemma 5.1 (smooth probabilities of deviation).

Let $m>2k>2$ , $\rho\in\mathcal{S}$ , $H\subset\mathcal{S}$ , and $\varepsilon>0$ . Then

[TABLE]

where $\varepsilon^{\prime}:=\varepsilon-\frac{2k}{m-k+1}-t_{k}$ with $t_{k}$ being the sum of all the weights of tuples longer than $k$ in the definition of $d$ : $t_{k}:=\sum_{i:|B_{i}|>k}w_{i}$ . Further,

[TABLE]

The meaning of this lemma is as follows. For any word $X_{1..m}$ , if it is far away from (or close to) a given distribution $\mu$ (in the empirical distributional distance), then some of its shorter subwords $X_{i..i+k}$ are far from (close to) $\mu$ too. In other words, for a stationary distribution $\mu$ , it cannot happen that a small sample is likely to be close to $\mu$ , but a larger sample is likely to be far.

Proof 5.2.

Let $B$ be a tuple such that $|B|<k$ and $X_{1..m}\in A^{m}$ be any sample of size $m>1$ . The number of occurrences of $B$ in $X$ can be bounded by the number of occurrences of $B$ in subwords of $X$ of length $k$ as follows:

[TABLE]

Indeed, summing over $i=1..m-k$ the number of occurrences of $B$ in all $X_{i..i+k-1}$ we count each occurrence of $B$ exactly $k-|B|+1$ times, except for those that occur in the first and last $k$ symbols. Dividing by $m-|B|+1$ , and using the definition (1), we obtain

[TABLE]

Summing over all $B$ , for any $\mu$ , we get

[TABLE]

where in the right-hand side $t_{k}$ corresponds to all the summands in the left-hand side for which $|B|>k$ , where for the rest of the summands we used $|B|\leq k$ . Since this holds for any $\mu$ , we conclude that

[TABLE]

Note that the $\hat{d}(X_{i..i+k-1},H)\in[0,1]$ . Therefore, for the average in the r.h.s. of (6) to be larger than $\varepsilon^{\prime}$ , at least $(\varepsilon^{\prime}/2)(m-k+1)$ summands have to be larger than $\varepsilon^{\prime}/2$ .

Using stationarity, we can conclude

[TABLE]

proving (2). The second statement can be proven similarly; indeed, analogously to (4) we have

[TABLE]

where we have used $|B|\geq 1$ . Summing over different $B$ , we obtain (similar to (5)),

[TABLE]

*(since the frequencies are non-negative, there is no $t_{n}$ term here). For the average in (7) to be smaller than $\varepsilon$ , at least half of the summands must be smaller than $2\varepsilon$ . Using stationarity of $\rho$ , this implies (3). *

Lemma 5.3.

*Let $\rho_{k}\in\mathcal{S}$ , $k\in\mathbb{N}$ be a sequence of processes that converges to a process $\rho_{*}$ . Then, for any $T\in A^{*}$ and $\varepsilon>0$ if $\rho_{k}(T)>\varepsilon$ for infinitely many indices $k$ , then $\rho_{*}(T)\geq\varepsilon$ *

Proof 5.4.

*The statement follows from the fact that $\rho(T)$ is continuous as a function of $\rho$ . *

Proof 5.5 (of Theorem 4.3.).

To establish the first statement of Theorem 4.3, we have to show that the family of tests $\psi^{\alpha}$ is consistent. By construction, for any $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E}$ we have $\rho(\psi^{\alpha}(X_{1..n})=1)\leq\alpha$ .

To prove the consistency of $\psi$ , it remains to show that

[TABLE]

for any $\xi\in H_{1}$ and $\alpha>0$ . To do this, fix any $\xi\in H_{1}$ and let

[TABLE]

Since $\xi\notin{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ , we have $\Delta>0$ . Suppose that there exists an $\alpha>0$ , such that, for infinitely many $n$ , some samples from the $\Delta/2$ -neighbourhood of $n$ -samples around $\xi$ are sorted as $H_{0}$ by $\psi$ , that is, $C^{n}({\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E},1-\alpha)\cap b_{\Delta/2}^{n}(\xi)\neq\varnothing$ . Then for these $n$ we have $\gamma_{n}({\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E},1-\alpha)\geq\Delta/2$ .

This means that there exists an increasing sequence $n_{m},m\in\mathbb{N}$ , and a sequence $\rho_{m}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ , $m\in\mathbb{N}$ , such that

[TABLE]

Using Lemma 5.1, (2) (with $\rho=\rho_{m}$ , $m=n_{m}$ , $k=n_{k}$ , and $H={\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ ), and taking $k$ large enough to have $t_{n_{k}}<\Delta/4$ , for every $m$ large enough to have $\frac{2n_{k}}{n_{m}-n_{k}+1}<\Delta/4$ , we obtain

[TABLE]

Thus,

[TABLE]

Since the set ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ is compact (as a closed subset of a compact set $\mathcal{S}$ ), we may assume (passing to a subsequence, if necessary) that $\rho_{m}$ converges to a certain $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . Since (9) this holds for infinitely many $m$ , using Lemma 5.3 (with $T=b^{n_{k}}_{\Delta/4}({\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E})$ ) we conclude that

[TABLE]

Since the latter inequality holds for infinitely many indices $k$ we also have

[TABLE]

However, we must have $\rho_{*}(\lim_{n\rightarrow\infty}\hat{d}(X_{1..n},{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E})=0)=1$ for every $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ : indeed, for $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E}$ it follows from Lemma 1.2, and for $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\backslash\mathcal{E}$ from Lemma 1.2, ergodic decomposition and the conditions of the theorem ( $W_{\rho}(H_{0})=1$ for $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ ).

This contradiction shows that for every $\alpha$ there are not more than finitely many $n$ for which $C^{n}({\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E},1-\alpha)\cap b_{\Delta/2}^{n}(\xi)\neq\varnothing$ . To finish the proof of the first statement, it remains to note that, as follows from Lemma 1.2,

[TABLE]

To establish the second statement of Theorem 4.3 we assume that there exists a consistent test $\varphi$ for $H_{0}$ against $H_{1}$ , and we will show that $W_{\rho}(H_{1})=0$ for every $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . Take $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ and suppose that

[TABLE]

We have

[TABLE]

*where the inequality follows from Fatou’s lemma (the functions under integral are all bounded by 1), and the equality from the consistency of $\psi$ . Thus, from some $n$ on we will have $\int_{H_{1}}dW_{\rho}\mu(\psi^{\delta/2}_{n}=0)<1/4$ . Taking into account (10), we conclude $\rho(\psi^{\delta/2}_{n}=0)<1-3\delta/4$ . For any set $T\in A^{n}$ the function $\mu(T)$ is continuous as a function of $T$ . In particular, it holds for the set $T:=\{X_{1..n}:\psi_{n}^{\delta/2}(X_{1..n})=0\}$ . Therefore, since $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ , for any $n$ large enough we can find a $\rho^{\prime}\in H_{0}$ such that $\rho^{\prime}(\psi^{\delta/2}_{n}=0)<1-3\delta/4$ , which contradicts the consistency of $\psi$ . Thus, $W_{\rho}(H_{1})=0$ , and Theorem 4.3 is proven. *

Proof 5.6 (of Theorem 4.1.).

To prove the first statement of the theorem, we will show that the test $\varphi_{H_{0},H_{1}}$ is a uniformly consistent test for ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E}$ against ${\operatorname{\scriptstyle\texttt{closure}}}(H_{1})\cap\mathcal{E}$ (and hence for $H_{0}$ against $H_{1}$ ), under the conditions of the theorem. Suppose that, on the contrary, for some $\alpha>0$ for every $n^{\prime}\in\mathbb{N}$ there is a process $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ such that $\rho(\varphi(X_{1..n})=1)>\alpha$ for some $n>n^{\prime}$ . Define

[TABLE]

which is positive since ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ and ${\operatorname{\scriptstyle\texttt{closure}}}(H_{1})$ are closed and disjoint. We have

[TABLE]

This implies that either

[TABLE]

or

[TABLE]

so that, by assumption, at least one of these inequalities holds for infinitely many $n\in\mathbb{N}$ for some sequence $\rho_{n}\in H_{0}$ . Suppose that it is the first one, that is, there is an increasing sequence $n_{i}$ , $i\in\mathbb{N}$ and a sequence $\rho_{i}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ , $i\in\mathbb{N}$ such that

[TABLE]

The set $\mathcal{S}$ is compact, hence so is its closed subset ${\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . Therefore, the sequence $\rho_{i}$ , $i\in\mathbb{N}$ must contain a subsequence that converges to a certain process $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . Passing to a subsequence if necessary, we may assume that this convergent subsequence is the sequence $\rho_{i}$ , $i\in\mathbb{N}$ itself.

Using Lemma 5.1, (2) (with $\rho=\rho_{n_{m}}$ , $m=n_{m}$ , $k=n_{k}$ , and $H={\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ ), and taking $k$ large enough to have $t_{n_{k}}<\Delta/4$ , for every $m$ large enough to have $\frac{2n_{k}}{n_{m}-n_{k}+1}<\Delta/4$ , we obtain

[TABLE]

That is, we have shown that for any large enough index $n_{k}$ the inequality $\rho_{n_{m}}(\hat{d}(X_{1..n_{k}},{\operatorname{\scriptstyle\texttt{closure}}}(H_{0}))\geq\Delta/4)>\Delta\alpha/16$ holds for infinitely many indices $n_{m}$ . From this and Lemma 5.3 with $T=T_{k}:=\{X:\hat{d}(X_{1..n_{k}},{\operatorname{\scriptstyle\texttt{closure}}}(H_{0}))\geq\Delta/4\}$ we conclude that $\rho_{*}(T_{k})>\Delta\alpha/16$ . The latter holds for infinitely many $k$ ; that is, $\rho_{*}(\hat{d}(X_{1..n_{k}},{\operatorname{\scriptstyle\texttt{closure}}}(H_{0}))\geq\Delta/4)>\Delta\alpha/16$ infinitely often. Therefore,

[TABLE]

However, we must have

[TABLE]

for every $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ : indeed, for $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\cap\mathcal{E}$ it follows from Lemma 1.2, and for $\rho_{*}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})\backslash\mathcal{E}$ from Lemma 1.2, ergodic decomposition and the conditions of the theorem.

Thus, we have arrived at a contradiction that shows that $\rho_{n}(\hat{d}(X_{1..n},{\operatorname{\scriptstyle\texttt{closure}}}(H_{0}))>\Delta/2)>\alpha/2$ cannot hold for infinitely many $n\in\mathbb{N}$ for any sequence of $\rho_{n}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . Analogously, we can show that $\rho_{n}(\hat{d}(X_{1..n},{\operatorname{\scriptstyle\texttt{closure}}}(H_{1}))<\Delta/2)>\alpha/2$ cannot hold for infinitely many $n\in\mathbb{N}$ for any sequence of $\rho_{n}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . Indeed, using Lemma 5.1, equation (3), we can show that $\rho_{n_{m}}(\hat{d}(X_{1..n_{m}},{\operatorname{\scriptstyle\texttt{closure}}}(H_{1}))\leq\Delta/2)>\alpha/2$ for a large enough $n_{m}$ implies $\rho_{n_{m}}(\hat{d}(X_{1..n_{k}},{\operatorname{\scriptstyle\texttt{closure}}}(H_{1}))\leq 3\Delta/4)>\alpha/4$ for a smaller $n_{k}$ . Therefore, if we assume that $\rho_{n}(\hat{d}(X_{1..n},{\operatorname{\scriptstyle\texttt{closure}}}(H_{1}))<\Delta/2)>\alpha/4$ for infinitely many $n\in\mathbb{N}$ for some sequence of $\rho_{n}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ , then we will also find a $\rho_{*}$ for which $\rho_{*}(\hat{d}(X_{1..n},{\operatorname{\scriptstyle\texttt{closure}}}(H_{1}))\leq 3\Delta/4)>\alpha/4$ for infinitely many $n$ , which, using Lemma 1.2 and ergodic decomposition, can be shown to contradict the fact that $\rho_{*}(\lim_{n\rightarrow\infty}d(X_{1..n},{\operatorname{\scriptstyle\texttt{closure}}}(H_{1}))\geq\Delta)=1$ .

Thus, returning to (11), we have shown that from some $n$ on there is no $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ for which $\rho(\varphi=1)>\alpha$ holds true. The statement for $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{1})$ can be proven analogously, thereby finishing the proof of the first statement.

To prove the second statement of the theorem, we assume that there exists a uniformly consistent test $\varphi$ for $H_{0}$ against $H_{1}$ , and we will show that $W_{\rho}(H_{1-i})=0$ for every $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{i})$ . Indeed, let $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ , that is, suppose that there is a sequence $\xi_{i}\in H_{0},i\in\mathbb{N}$ such that $\xi_{i}\to\rho$ . Assume $W_{\rho}(H_{1})=\delta>0$ and take $\alpha:=\delta/2$ . Since the test $\varphi$ is uniformly consistent, there is an $N\in\mathbb{N}$ such that for every $n>N$ we have

[TABLE]

*Recall that, for $T\in A^{*}$ , $\mu(T)$ is a continuous function in $\mu$ . In particular, this holds for the set $T=\{X\in A^{n}:\varphi(X)=0\}$ , for any given $n\in\mathbb{N}$ . Therefore, for every $n>N$ and for every $i$ large enough, $\rho_{i}(\varphi(X_{1..n})=0)<1-\delta/2$ implies also $\xi_{i}(\varphi(X_{1..n})=0)<1-\delta/2$ which contradicts $\xi_{i}\in H_{0}$ . This contradiction shows $W_{\rho}(H_{1})=0$ for every $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ . The case $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{1})$ is analogous. *

6 Examples

Theorems 4.3 and 4.1 can be used to check whether a consistent test exists for such problems as identity, independence, estimating the order of a (Hidden) Markov model, bounding entropy, bounding distance, uniformity, monotonicity, etc. Some of these examples are considered in this section.

1 Simple hypotheses, identity or goodness-of-fit testing

First of all, it is obvious that sets that consist of just one or finitely many stationary ergodic processes are closed and closed under ergodic decompositions. Thus, they meet the conditions of Theorem 4.1, and so, for any pair of disjoint sets of this type, there exists a uniformly consistent test. (In particular, there is a uniformly consistent test for $H_{0}=\{\rho_{0}\}$ against $H_{1}=\{\rho_{1}\}$ iff $\rho_{0}\neq\rho_{1}$ .)

A more interesting case is identity testing, also known as goodness-of-fit: this problem consists in testing whether a distribution generating the sample obeys a certain given law, versus it does not. Thus, let $\rho\in\mathcal{E}$ , $H_{0}=\{\rho\}$ and $H_{1}=\mathcal{E}\backslash H_{0}$ . In such a case there is an asymmetrically consistent test for $H_{0}$ against $H_{1}$ : indeed, the conditions of Theorem 4.4 are easily verified. It is worth noting that (asymmetric) identity testing is a classical problem of mathematical statistics, with solutions (e.g. based on Pearson’s $\chi^{2}$ statistic) for i.i.d. data (e.g. Lehmann:86 ), and Markov chains Billingsley:61 . For stationary ergodic processes, BRyabko:06b gives an asymmetrically consistent test when $H_{0}$ has a finite and bounded memory, and Ryabko:103s for the general case of stationary ergodic real-valued processes.

As far as uniform testing is concerned, it is, first of all, clear that, just like in the i.i.d. case (cf. Section 3), for any $\rho_{0}$ there is no uniformly consistent test for identity. Indeed, as we have seen (Corollary 4.2), for any non-empty $H_{0}$ there is no uniformly consistent test for $H_{0}$ against $\mathcal{E}\backslash H_{0}$ provided neither hypothesis is non-empty. One might suggest at this point that, as in the i.i.d. case, a uniformly consistent test exists if we restrict $H_{1}$ to those processes that are sufficiently far from $\rho_{0}$ , for example, by introducing some $\varepsilon$ -padding around $H_{0}$ . However, this is not the case. We can prove an even stronger negative result.

Proposition 6.1.

*Let $\rho,\nu\in\mathcal{E}$ , $\rho\neq\nu$ and let $\varepsilon>0$ . There is no uniformly consistent test for $H_{0}=\{\rho\}$ against $H_{1}=\{\nu^{\prime}\in\mathcal{E}:d(\nu^{\prime},\nu)\leq\varepsilon\}$ . *

The following conclusion can be made from this proposition. {svgraybox} While distributional distance is well-suited for characterizing those hypotheses for which consistent tests exist, it is not suited for formulating the actual hypotheses.

Apparently, a stronger distance is needed for the latter.

Proof 6.2 (of Proposition 6.1).

*Consider the process $(X_{1},Y_{1}),(X_{2},Y_{2}),\dots$ on pairs

$(X_{i},Y_{i})\in A^{2}$ , such that the distribution of $X_{1},X_{2},\dots$ is $\nu$ , the distribution of $Y_{1},Y_{2},\dots$ is $\rho$ and the two components $X_{i}$ and $Y_{i}$ are independent; in other words, the distribution of $(X_{i},Y_{i})_{i\in\mathbb{N}}$ is $\nu\times\rho$ . Consider also a two-state stationary ergodic Markov chain $\mu$ , with two states $1$ and $2$ , whose transition probabilities are $\left(\begin{array}[]{cc}1-p&p\\ q&1-q\end{array}\right)$ , where $0<p<q<1$ . The limiting (and initial) probability of the state $1$ is $p/(p+q)$ and that of the state $2$ is $q/(p+q)$ . Finally, the process $Z_{1},Z_{2},\dots$ is constructed as follows: $Z_{i}=X_{i}$ if $\mu$ is in the state $a$ and $Z_{i}=Y_{i}$ otherwise (here it is assumed that the chain $\mu$ generates a sequence of outcomes independently of $(X_{i},Y_{i})$ ). Clearly, for every $p,q$ satisfying $0<p<q<1$ the process $Z_{1},Z_{2},\dots$ is stationary ergodic. Let $p_{m}:=1/(m+1)$ , $q_{m}:=\delta p_{m}/(1-\delta)$ for all $m\in\mathbb{N}$ , where $\delta$ is a parameter to be defined shortly. Denote $\zeta_{m}$ the distribution of the process $(Z_{i})_{i\in\mathbb{N}}$ with parameters $p_{m},q_{m}$ . With these parameters, $\mu(1)=\delta$ independently of $m$ (i.e, the Markov chain underlying $\zeta_{m}$ spends $\delta$ time in the first state). Find $\delta>0$ sufficiently small so as to have for all $m$ sufficiently large $d(\nu,\zeta_{m})<\varepsilon$ , as is always possible since $\lim_{\delta\to 0}\zeta_{m}=\nu$ uniformly in $m$ . Thus, $\zeta_{m}\in H_{1}$ for all $m\in\mathbb{N}$ . However, $\lim_{m\to\infty}\zeta_{m}=\zeta_{\infty}$ where $\zeta_{\infty}$ is the stationary distribution with $W_{\zeta_{\infty}}(\rho)=\delta$ and $W_{\zeta_{\infty}}(\nu)=1-\delta$ . Therefore, $\zeta_{\infty}\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{1})$ and $W_{\zeta_{\infty}}(H_{0})>0$ , so that by Theorem 4.1 there is no uniformly consistent test for $H_{0}$ against $H_{1}$ . *

2 Markov and Hidden Markov processes: bounding the order

Let us next consider finite-state Markov and hidden Markov processes.

For any $k$ , there is an asymmetrically consistent test of the hypothesis $\mathcal{M}_{k}$ = “the process is Markov of order not greater than $k$ ” against $\mathcal{E}\backslash\mathcal{M}_{k}$ . For any $k$ , there is an asymmetrically consistent test of $\mathcal{HM}_{k}$ =“the process is given by a Hidden Markov process with not more than $k$ states” against $H_{1}=\mathcal{E}\backslash\mathcal{HM}_{k}$ . Indeed, in both cases ( $k$ -order Markov, Hidden Markov with not more than $k$ states), the hypothesis $H_{0}$ is a parametric family, with a compact set of parameters, and a continuous function mapping parameters to processes (that is, to the space $\mathcal{S}$ ). Since the space $\mathcal{S}$ of stationary processes is compact, Weierstrass theorem then implies that the image of such a compact parameter set is closed (and compact). Moreover, in both cases $H_{0}$ is closed under taking ergodic decompositions. Thus, by Theorem 4.3, there exists an asymmetrically consistent test.

The problem of estimating the order of a (hidden) Markov process based on sampling had been addressed in a number of works. In the contest of hypothesis testing, asymmetrically consistent tests for $\mathcal{M}_{k}$ against $\mathcal{M}^{t}$ with $t>k$ were given in Anderson:57 , see also Billingsley:61 . The existence of non-uniformly consistent tests (a notion weaker than that of asymmetric consistency) for $\mathcal{M}_{k}$ against $\mathcal{E}\backslash\mathcal{M}_{k}$ , and of $\mathcal{HM}_{k}$ against $\mathcal{E}\backslash\mathcal{HM}_{k}$ , was established in Kieffer:93 . Asymmetrically consistent tests for $\mathcal{M}_{k}$ against $\mathcal{E}\backslash\mathcal{M}_{k}$ were obtained in BRyabko:06a , while for the formulation above that includes the case of asymmetric testing for $\mathcal{HM}_{k}$ against $\mathcal{E}\backslash\mathcal{HM}_{k}$ is from Ryabko:121c .

Considering the set $\mathcal{M}_{*}:=\cup_{k\in\mathbb{N}}\mathcal{M}_{k}$ of all finite-memory processes, it is easy to see that there is no asymmetrically consistent test for this set against its complement: indeed, ${\operatorname{\scriptstyle\texttt{closure}}}(M)_{*}=\mathcal{S}$ , so by Corollary 4.3 there is no test. There is also no asymptotically consistent test for this hypothesis, even though it is possible to construct an estimator of the order of a Markov chain that tends to infinity if the process is not Markov; see Morvai:05 and references.

3 Smooth parametric families

From the discussion in the previous example we can see that the following generalization is valid. Let $H_{0}\subset\mathcal{S}$ be a set of processes that is continuously parametrized by a compact set of parameters. If $H_{0}$ is closed under taking ergodic decompositions, then there is an asymmetrically consistent test for $H_{0}$ against $\mathcal{E}\backslash H_{0}$ . In particular, this strengthens the mentioned result of Kieffer:93 , since a stronger notion of consistency is used, as well as a more general class of parametric families is considered.

Clearly, a similar statement can be derived for uniform testing: given two disjoint sets $H_{0}$ and $H_{1}$ each of which is continuously parametrized by a compact set of parameters and is closed under taking ergodic decompositions, there exists a uniformly consistent test of $H_{0}$ against $H_{1}$ .

4 Homogeneity testing or process discrimination

This problem consists in testing, given two samples $X_{1..n}^{1}$ and $X_{1..n}^{2}$ , whether the distributions generating these samples are the same or different. We have considered this problem in details in Section 4 for the case of asymptotic consistency and stationary ergodic distinctions (and $B$ -processes), and in Section 3 for the case of asymmetric and uniform consistency and smaller sets of distributions. The results can be summarized in the following table. Here we omit uniform testing in view of Corollary 4.2.

5 Independence

Again, we are given two samples, $X_{1..n}^{1}$ and $X_{1..n}^{2}$ . The hypothesis of independence is that the first process is independent from the second: $\rho(X^{1}_{1..t}\in T_{1},X^{2}_{1..t}\in T_{2})=\rho(X^{1}_{1..t}\in T_{1})\rho(X^{2}_{1..t}\in T_{2})$ for any $(T_{1},T_{2})\in A^{n}$ and any $n\in\mathbb{N}$ .

Let $\mathcal{I}$ be the set of all stationary ergodic processes (on pairs) satisfying this property.

Proposition 6.3.

*There is no asymmetrically consistent test for independence (for jointly stationary ergodic samples). *

Proof 6.4.

The example is based on the so-called translation process, which is constructed as follows. Fix some irrational $\alpha\in(0,1)$ and select $r_{0}\in[0,1]$ uniformly at random. For each $i=1..n..$ let $r_{i}=(r_{i-1}+\alpha)\mod 1$ (that is, the previous element is shifted by $\alpha$ to the right, considering the [0,1] interval looped). The samples $X_{i}$ are obtained from $r_{i}$ by thresholding at $1/2$ , i.e. $X_{i}:=\mathbb{I}\{r_{i}>0.5\}$ (here $r_{i}$ can be considered hidden states). This process is stationary and ergodic; besides, it has 0 entropy rate Shields:98 , and this is not the last of its peculiarities.

Take now two independent copies of this process to obtain a pair $({\bf x}_{1},{\bf x}_{2})=(X_{1}^{1},X_{1}^{2}\dots,X_{n}^{1},X_{n}^{2},\dots)$ . The resulting process on pairs, which we denote $\rho$ , is stationary, but it is not ergodic. To see the latter, observe that the difference between the corresponding hidden states remains constant. In fact, each initial state $(r_{1},r_{2})$ corresponds to an ergodic component of our process on pairs. By the same argument, these ergodic components are not independent. Thus, we have taken two independent copies of a stationary ergodic process, and obtained a stationary process which is not ergodic and whose ergodic components are pairs of processes that are not independent!

*To apply Corollary 4.4, it remains to show that the process $\rho$ we constructed can be obtained as a limit of stationary ergodic processes on pairs. To see this, consider, for each $\varepsilon$ , a process $\rho_{\varepsilon}$ , whose construction is identical to $\rho$ except that instead of shifting the hidden states by $\alpha$ we shift them by $\alpha+u_{i}^{\varepsilon}$ where $u_{i}^{\varepsilon}$ are i.i.d. uniformly random on $[-\varepsilon,\varepsilon]$ . It is easy to see that $\lim_{\varepsilon\to 0}\rho_{\varepsilon}=\rho$ in distributional distance, and all $\rho_{\varepsilon}$ are stationary ergodic. Thus, if $H_{0}$ is the set of all stationary ergodic distributions on pairs, we have found a distribution $\rho\in{\operatorname{\scriptstyle\texttt{closure}}}(H_{0})$ such that $W_{\rho}(H_{0})=0$ . We can conclude that there is no $\alpha$ -level consistent test for $H_{0}$ against its complement. *

In contrast to the situation with homogeneity testing described in Section 3, testing independence becomes possible if we restrict the processes to be Markov.

Indeed, using the notation of the previous sections, it is easy to see that Theorem 4.3 implies that there exists an asymmetrically consistent test for $\mathcal{I}\cap\mathcal{M}_{k}$ against $\mathcal{E}\backslash\mathcal{I}$ , for any given $k\in\mathbb{N}$ . Analogously, if we confine $H_{0}$ to Hidden Markov processes of a given order, then asymmetric testing is possible. That is, there exists an an asymmetrically consistent test for $\mathcal{I}\cap\mathcal{HM}_{k}$ against $\mathcal{E}\backslash\mathcal{I}$ , for any given $k\in\mathbb{N}$ .

7 Open problems

In spite of rather general results on the existence of tests presented in this chapter, perhaps it would not be an exaggeration to say that the most important questions remain open. This section attempts to precise and summarize these.

1 Relating the notions of consistency

Before delving deeper into problems relating various notions of consistency and generalizing the corresponding results, note that two of the notions of consistency considered, asymmetric ( $\alpha$ -level) consistency and asymptotic consistency, require a certain convergence to hold with probability 1. Naturally, one could replace this convergence with convergence in probability. Let us call the resulting notion weak asymmetric or asymptotic consistency, and those introduced above let us call strong. While weak consistency indeed appears weaker at first sight, it is easy to see, as Nobel Nobel:06 remarks, that weak asymptotic consistency implies strong asymptotic consistency for the case of i.i.d. or strongly mixing processes. It is similarly easy to verify that the same is true for asymmetric consistency. Moreover, for asymmetric consistency, the criterion given in Corollary 4.4 holds equally well for strong and for weak consistency, so in the case $H_{0}=\mathcal{E}\backslash H_{1}$ weak and strong asymmetric consistency are equivalent for stationary ergodic distributions as well. This suggests that these notions may be equivalent in general.

Conjecture 7.1 (weak=strong).

*For stationary ergodic distributions, if there exists a weakly asymmetrically consistent (weakly asymptotically consistent) test, then there exists a strongly consistent asymmetrically (strongly asymptotically consistent) test. *

Passing to the relations between the notions of consistency, it might at first glance seam that asymmetric consistency is rather weak, since one of the errors does not go to zero. However, note that it is fixed at the given level $\alpha$ independently of the sample size, and uniformly over $H_{0}$ , making the resulting notion very strong. In fact, from the discussion on the i.i.d. processes in Section 3, one can see that, for i.i.d. examples, uniform consistency is strictly stronger than asymmetric consistency, and asymmetric consistency is strictly stronger than asymptotic consistency (in terms of the existence of tests). One can conjecture that this is the case for stationary ergodic distributions as well.

Conjecture 7.2 (uniform $\Rightarrow$ asymmetric $\Rightarrow$ asymptotic consistency).

*Let $H_{0},H_{1}\subset\mathcal{E}$ . If there exists a uniformly consistent test for $H_{0}$ against $H_{1}$ , then there exists an asymmetrically consistent test for this pair of hypotheses. If there exists an asymmetrically consistent test for $H_{0}$ against $H_{1}$ , then there exists an asymptotically consistent test for this pair of hypotheses. The opposite implications do not hold. *

Note that the implication “uniform $\Rightarrow$ asymptotic consistency” is rather obvious, and it is also obvious that the opposite does not hold. The question is, therefore, about the place of asymmetric consistency in the middle; more precisely, whether the strict inclusion generalises from the i.i.d. to the stationary ergodic case. {svgraybox} It remains open to see whether the relation between the notions of consistency (uniform, asymmetric, asymptotic, weak/strong) that holds for i.i.d. processes carries over to the stationary ergodic case.

2 Characterizing hypotheses for which consistent tests exist

The main open problem that remains is to find necessary and sufficient conditions for the existence of each kind of the tests: uniform, asymmetric, and asymptotic.

Problem 7.3.

*Find necessary and sufficient conditions on hypotheses $H_{0},H_{1}\subset\mathcal{E}$ for the existence of (uniformly, asymmetrically, asymptotically) consistent tests. *

The only case for which the presented necessary and sufficient conditions coincide is the case of asymmetric consistency when $H_{1}=\mathcal{E}\setminus H_{0}$ . It is not known whether the same conditions are necessary and sufficient for general pairs $H_{0},H_{1}$ (i.e., when $H_{1}$ is not necessarily the complement of $H_{0}$ ). However, the fact that for this case we have an “if and only if” criterion, suggests that the topology of the distributional distance is indeed the right one to consider for such characterisations.

Another important problem is to generalize the results of Chapter 4 to real-valued processes.

Problem 7.4.

*Find generalisations of Theorems 4.3, 4.1 to real-valued processes. *

The main difference for the real-valued case is that, in the finite-alphabet case, the distributional distance in the form (4) gives a compact space of distributions. This fact has been relied upon heavily in the proofs of the corresponding theorems. The distributional distance in the form (5) does not result in a compact space of distributions. The general form (2) can give a compact space; indeed, as mentioned in Chapter 1, this is the case if the sets $(B_{i})_{i\in\mathbb{N}}$ form is a standard basis. However, as Gray:88 mentions, there is no easy constructing of such a basis for the real-valued case, even though such a basis exists. On the other hand, an explicit construction is required in order to speak about distance estimates.

3 Independence testing

Recall the problem of independence from Section 5: given two samples, $X_{1},\dots,X_{n}$ and $Y_{1},\dots,Y_{m}$ , it is required to test whether the process generating the first sample is independent from the one generating the second.

It is interesting to note that for the case of i.i.d. data, the problems of homogeneity testing and independence testing can be reduced to one another. The situation is different for dependent data, as we have seen already for the case of (discrete-state) Markov chains: for these processes, there exists an asymmetric test for independence but not for homogeneity. Moreover, whereas for homogeneity (process discrimination) we have seen in Section 4 that there is no asymptotically consistent test, for independence the question of the existence of such a test remains open.

Thus, we can formulate what is known and what is not known about this problem in the following table, which can be compared to the one about homogeneity testing (Table 1).

Chapter 5 Generalizations

In this chapter we outline a number of generalizations of the results described in this volume. Some of these have already been made, while others present interesting directions for future research.

1 Other distances

The empirical distributional distance on which the results of the previous chapters hinge can be seen as an ordinate way of counting frequencies of everything. One may wonder whether the same theoretical consistency results can be obtained while allowing one to benefit from using some of the more sophisticated tools in the box.

This is, indeed, possible, by considering different distances between processes, and then plugging in their estimates into the same algorithms. Here we try to see what distances can be used and which properties are required. While doing so we are mostly concerned with generalizing the results of Chapters 2 and 3, as the theory of hypothesis testing of Chapter 4 is somewhat more delicate.

Introduce the notation $\rho^{k}$ for the $k$ -dimensional marginal distribution of a time-series distribution $\rho$ .

1 $\operatorname{sum}$ Distances

Observe that the distributional distance $d$ in its more-specified formulations (4) and (5) has the form

[TABLE]

where $w_{k}$ are summable positive real weights and $d_{k}()$ is a certain distance between $k$ -dimensional marginal distributions.

It is easy to see that distances of this form can be consistently estimated, as long as $d_{k}$ can be consistently estimated for each $k\in\mathbb{N}$ ; this is formalized in the following statement.

Proposition 1.1 (estimating sum-based distances).

*Let $\mathcal{C}$ be a set of process distributions. Let $d_{k},k\in\mathbb{N}$ be a series of distances on the spaces of distributions over $A^{k}$ that are bounded uniformly in $k$ , and such that there exists a series $\hat{d}_{k}(X_{1..n},Y_{1..n}),k\in\mathbb{N}$ of their consistent estimates: $\lim_{n\to\infty}\hat{d}_{k}(X_{1..n},Y_{1..n})=d_{k}(\rho_{1}^{k},\rho_{2}^{k})$ a.s., whenever $\rho_{1},\rho_{2}\in\mathcal{C}$ are chosen to generate the sequences. Then the distance $D$ given by (1) can be consistently estimated using the estimate $\sum_{k\in\mathbb{N}}w_{k}\hat{d}_{k}(X_{1..n},Y_{1..n})$ . *

Clearly, the distributional distance $d$ is an example of a distance in the form (1), and it satisfies the conditions of the proposition with $\mathcal{C}$ being the set of all stationary ergodic processes. Another example is the telescope distance considered in the next subsection.

2 Telescope distance

The telescope distance, introduced in Ryabko:13red+ , is, in fact, a scheme for defining distances between processes. In order to define the telescope distance, we first start with a metric on distributions on $A^{k}$ . For two probability distributions $P$ and $Q$ on $(A^{k},{\mathcal{B}}_{k})$ for some $k\in\mathbb{N}$ and a set $\mathcal{H}$ of measurable functions on $A^{k}$ , one can define the distance

[TABLE]

This metric in its general form has been studied since at least Zolotarev:83 and includes Kolmogorov-Smirnov Kolmogorov:33 and Kantorovich-Rubinstein Kantorovich:57 metrics as special cases. It is measurable under mild conditions; in particular, separability of $\mathcal{H}$ is sufficient for this. Moreover, it is easy to check that $d_{\mathcal{H}}$ is a metric on the space of probability distributions over $A^{k}$ if and only if $\mathcal{H}$ generates $B_{k}$ .

An example of the sets $\mathcal{H}$ are the sets of hyperplanes in $\mathbb{R}^{k}$ , $k\in\mathbb{N}$ .

Based on $d_{\mathcal{H}}$ we can construct a distance between time-series probability distributions. For two time-series distributions $\rho_{1},\rho_{2}$ and sets $\mathcal{H}_{k}$ of functions on $A^{k}$ , $k\in\mathbb{N}$ , we take the $d_{\mathcal{H}_{k}}$ between $k$ -dimensional marginal distributions of $\rho_{1}$ and $\rho_{2}$ for each $k\in\mathbb{N}$ , and sum them all up with decreasing weights.

Definition 1.2 (telescope distance).

For two processes $\rho_{1}$ and $\rho_{2}$ and a sequence of sets of functions $\mathbf{H}=(\mathcal{H}_{1},\mathcal{H}_{2},\dots)$ define the telescope distance

[TABLE]

where $w_{k}$ , $k\in\mathbb{N}$ is a sequence of positive summable real weights (e.g., the weights we were using before, $w_{k}:=1/k(k+1)$ ).

The empirical telescope distance is defined as

[TABLE]

It is shown in Ryabko:13red+ that the empirical telescope distance so defined is a consistent estimate of the telescope distance, if the sets $\mathcal{H}_{k}$ are separable sets of indicator function of finite VC dimension. The separability condition comes from Adams:12 where the corresponding uniform convergence result is established.

The main appeal of the telescope distance is that it can be estimated using binary classification methods developed for i.i.d. data. Such methods are abound in the machine learning literature. Thus, the telescope distance allows one to channel these methods for use in problems involving time series, such as clustering and the three-sample problem considered in Chapters 2, 3.

The details of the algorithms, as well as the proofs and experimental results, can be found in Ryabko:13red+ .

3 $\operatorname{sup}$ Distances

A different way to construct a distance between time-series distributions based on their finite-dimensional marginals is to use the supremum instead of summation in (1):

[TABLE]

Some commonly used metrics are defined in the form (4) or have natural interpretations in this form, as the following two examples show.

Definition 1.3 (total variation).

*For time-series distributions $\nu,\mu$ the total variation distance between them is defined as $D_{tv}(\mu,\nu):=\sup_{A\in{\mathcal{B}}}|\mu(A)-\nu(A)|$ . *

It is easy to see that $D_{tv}(\mu,\nu)=\sup_{k\in\mathbb{N}}\sup_{A\in{\mathcal{B}}_{k}}|\mu(A)-\nu(A)|$ , so that the total variation distance has the form (4).

For stationary ergodic distributions this distance is not very useful, since it just gives the discrete distance: $D_{tv}(\mu,\nu)=1$ if and only if $\mu\neq\nu$ . This follows from the fact that any two different stationary ergodic distributions are singular with respect to one another.

Another example of a $\operatorname{sup}$ -distance is the $\bar{d}$ distance, defined in Section 4. To see that it is indeed a $\operatorname{sup}$ -distance, consider the following definition of it, which is equivalent to the previous one (see, e.g. Shields:96 ; Ornstein:90 )

[TABLE]

where $P$ is the set of all distributions over $A^{k}\times A^{k}$ generating a pair of samples $X_{1..k},Y_{1..k}$ whose marginal distributions are $\rho_{1}^{k}$ and $\rho_{2}^{k}$ correspondingly.

As explained in Section 4, this distance turns out to bee too strong for stationary ergodic processes but still useful for $B$ -processes, since it is only possible to construct its consistent estimates for the latter set.

4 Non-metric distances

So far we have been considering distances that constitute a metric on the space of all process distributions, or on the space of stationary process distributions. In particular, they have the property of exactness, that is $d(\rho_{1},\rho_{2})=0$ if and only if $\rho_{1}=\rho_{2}$ . This allowed us to solve such problems as clustering (with respect to distribution), where we cluster together those and only those samples that were generated by the same distribution.

Sometimes a weaker goal may be appropriate. For example, one may wish to distinguish only between distributions that have different single-dimensional means and variances, or some other characteristics. Depending on the characteristics of the processes studied, it may be more or less straightforward to establish the consistency of their empirical estimates. However, if consistent empirical estimates are available, it should be reasonably straightforward to translate the algorithms and the results on clustering and change-point problems to such distances.

5 AMS distributions

A particular instance of non-metric distances described in the previous section are distances between the asymptotic-mean distributions of ergodic (non-stationary) or AMS distributions. For non-stationary distributions, in general, one cannot make any inference about the distribution of any initial segment given just one time series sample, which is the case in all the problems we have considered. However, we can make inference about the asymptotic means. We can thus consider the distance between the asymptotic-mean distributions. It is, in fact, the same distributional distance that we have worked with in this volume, only considered as the distance between asymptotic-mean distributions and not the process distributions themselves. Of course, its empirical estimates simply carry over. Note that, considered as a distance between process distributions, it is not a metric, since we can have $d(\rho_{1},\rho_{2})=0$ for $\rho_{1},\rho_{2}$ that are different (but have the same asymptotic mean). With this distinction in mind, all the formulations of basic-inference, clustering and change-point problems translate to this this more general setting, with “ergodic” substituted for “stationary ergodic” and “AMS” for “stationary,” and the proofs carry over intact.

2 Piece-wise stationary processes

When dealing with change-point problems (Section 2), we have defined a set of process distributions that can be seen as a generalization of stationary process distributions: piece-wise stationary processes. These are constructed by defining a sequence of integer-valued change points, such as between each two consecutive change points the distribution is stationary (or stationary ergodic).

This kind of construction has been widely studied for more restrictive sets of processes, and mainly for i.i.d. processes, resulting in piece-wise i.i.d. models; see, for example Willems:96 ; Gyorgy:12 and references.

For the stationary ergodic case, we have seen that meaningful inference is possible for finitely many change points and linear-sized (in the total sample size $n$ ) segments between change points. While, constrained by the nature of the change-points problems we have considered, we have only dealt with fixed sample size and offline formulations, the distributions can be defined in a similar fashion on infinite sequences. A piece-wise stationary distribution is thus identified with a sequence of stationary distributions and a sequence of change points. A number of inference problems can be formulated about these processes, including versions of the clustering and hypotheses-testing problems considered in this volume. Offline clustering and identity testing appear to be the first interesting problems to explore in this regard.

3 Beyond time series

1 Processes over multiple dimensions

Time series, or discrete-time process distributions that are subject of this volume, can be seen as discrete-coordinate stochastic processes extending to infinity in one dimension. One can also consider discrete-coordinate multi-dimensional stochastic processes. The concept of stationarity and ergodicity can be defined similarly to the single-dimensional case. Thus, for a dimension $d\in\mathbb{N}$ , one can consider a process $(X_{u})$ indexed by $u\in\mathbb{N}^{d}$ , over the space $((A^{\infty})^{d},\Omega)$ where $\Omega$ is the Borel sigma-algebra. Such processes are simply probability measures over $((A^{\infty})^{d},\Omega)$ . Stationarity can be defined using shifts $T_{i}$ along each coordinate $i\in\{1..d\}$ . A process measure $\rho$ is called stationary if it is preserved under shifts, that is $\rho(X_{[0,v)}\in B)=\rho(T_{u}X_{[0,v)}\in B)$ for all $u,v\in V$ and all Borel $B$ . Ergodic theorems can be established for such processes, see, for example, Krengel:85 . This is all one needs to use empirical estimates of the distributional distance, and thus formulate and solve basic-inference as well as clustering problems, similar to how it is done in Sections 3, 2. The construct of the distributional distance appears to be general enough even for some results on hypothesis testing of Chapter 4 to be generalizable to this setting.

Change-point problems morph into something much more complex, as change points become change boundaries. It thus appears interesting to explore what kind of change-point-like problems admit solutions in this more general setting.

2 Infinite random graphs

Another way to generalize time series is to consider infinite random graphs. The necessary probability-theoretic foundations have been laid out in Aldous:07 ; lyons2016probability , while the work Benjamini:12 uses these to introduce the notions and establish some basic facts of the ergodic theory on these spaces. It turns out that the distributional distance is a general enough construction to be ported directly to this more general case, and some of the results of this volume, including Theorem 4.3, can be generalized with little extra work. This is done in the work Ryabko:17gratest , which also outlines a number of interesting research directions that emerge in this area.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Terrence M. Adams and Andrew B. Nobel. Uniform approximation of Vapnik-Chervonenkis classes. Bernoulli , 18(4):1310–1319, 2012.
2[2] David Aldous and Russell Lyons. Processes on unimodular random networks. Electron. J. Probab. , 12:no. 54, 1454–1508, 2007.
3[3] P.H. Algoet. Universal schemes for prediction, gambling and portfolio selection. The Annals of Probability , 20(2):901–941, 1992.
4[4] T. Anderson and L. Goodman. Statistical inference about Markov chains. Ann. Math. Stat. , 28(1):89–110, 1957.
5[5] M. Basseville and I.V. Nikiforov. Detection of abrupt changes: theory and application . Prentice Hall information and system sciences series. Prentice Hall, 1993.
6[6] Tugkan Batu, Eldar Fischer, Lance Fortnow, Ravi Kumar, Ronitt Rubinfeld, and Patrick White. Testing random variables for independence and identity. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on , pages 442–451. IEEE, 2001.
7[7] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld. Sublinear algorithms for testing monotone and unimodal distributions. In STOC , volume 4, pages 381–390, 2004.
8[8] Itai Benjamini and Nicolas Curien. Ergodic theory on stationary random graphs. Electron. J. Probab. , 17:no. 93, 1–20, 2012.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Asymptotic nonparametric statistical analysis of stationary time series

Acknowledgements

Contents

Chapter 0 Introduction

1 Stationarity, ergodicity, AMS

2 What is possible and what is not possible to infer from stationary processes

3 Overview of the inference problems covered

Chapter 1 Preliminaries

Definition 0.1**.**

1 Stationarity, ergodicity

Theorem 1.1** (ergodic theorem).**

Theorem 1.2** (Ergodic decomposition).**

2 Distributional distance

Definition 2.1** (distributional distance).**

Definition 2.2** (Distributional distance for finitely-valued processes).**

Definition 2.3** (Distributional distance for real-valued processes).**

Chapter 2 Basic inference

1 Estimating the distance between processes and reconstructing a process

Definition 1.1** (empirical distributional distance).**

Lemma 1.2**.**

Proof 1.3**.**

2 Calculating d^\hat{d}d^

3 The three-sample problem

Definition 3.1** (Process classifier).**

Theorem 3.2**.**

Proof 3.3**.**

4 Impossibility of discrimination

1 Setup and definitions

Definition 4.1**.**

2 The main result

Theorem 4.2**.**

Proof 4.3**.**

Chapter 3 Clustering and change-point problems

1 Time-series clustering

1 Problem formulation

Definition 1.1** (asymptotic consistency).**

2 A clustering algorithm and its consistency

Theorem 1.2**.**

Proof 1.3**.**

3 Extensions: unknown kkk, online clustering and clustering with respect to independence

Unknown number of clusters

Online clustering

Theorem 1.4**.**

Clustering with respect to independence

Definition 1.5** (sum-information).**

2 Change-point problems

1 Single change point

Definition 2.1** (Change point estimator).**

Theorem 2.2**.**

Proof 2.3**.**

2 Multiple change points, known number of change points

Theorem 2.4**.**

3 Unknown number of change points

Listing change points

Theorem 2.5**.**

Known number of distributions, unknown number of change points

Theorem 2.6**.**

Chapter 4 Hypothesis testing

1 Introduction

1 Motivation and examples

2 Types of consistency

1 Uniform consistency

Definition 2.1** (uniform consistency).**

2 Asymmetric consistency

Definition 2.2** (Asymmetric consistency).**

3 Asymptotic consistency

Definition 2.3** (asymptotic consistency).**

4 Other notions of consistency

3 One example that explains hypotheses testing

1 Bernoulli i.i.d. processes

2 Markov chains

3 Stationary ergodic processes

4 Topological characterizations

1 Uniform testing

Definition 0.1.

Theorem 1.1 (ergodic theorem).

Theorem 1.2 (Ergodic decomposition).

Definition 2.1 (distributional distance).

Definition 2.2 (Distributional distance for finitely-valued processes).

Definition 2.3 (Distributional distance for real-valued processes).

Definition 1.1 (empirical distributional distance).

Lemma 1.2.

Proof 1.3.

2 Calculating $\hat{d}$

Definition 3.1 (Process classifier).

Theorem 3.2.

Proof 3.3.

Definition 4.1.

Theorem 4.2.

Proof 4.3.

Definition 1.1 (asymptotic consistency).

Theorem 1.2.

Proof 1.3.

3 Extensions: unknown $k$ , online clustering and clustering with respect to independence

Theorem 1.4.

Definition 1.5 (sum-information).

Definition 2.1 (Change point estimator).

Theorem 2.2.

Proof 2.3.

Theorem 2.4.

Theorem 2.5.

Theorem 2.6.

Definition 2.1 (uniform consistency).

Definition 2.2 (Asymmetric consistency).

Definition 2.3 (asymptotic consistency).

Theorem 4.1 (uniform testing).

Corollary 4.2.

Theorem 4.3.

Corollary 4.4.

Lemma 5.1 (smooth probabilities of deviation).

Proof 5.2.

Lemma 5.3.

Proof 5.4.

Proof 5.5 (of Theorem 4.3.).

Proof 5.6 (of Theorem 4.1.).

Proposition 6.1.

Proof 6.2 (of Proposition 6.1).

Proposition 6.3.

Proof 6.4.

Conjecture 7.1 (weak=strong).

Conjecture 7.2 (uniform $\Rightarrow$ asymmetric $\Rightarrow$ asymptotic consistency).

Problem 7.3.

Problem 7.4.

1 $\operatorname{sum}$ Distances

Proposition 1.1 (estimating sum-based distances).

Definition 1.2 (telescope distance).

3 $\operatorname{sup}$ Distances

Definition 1.3 (total variation).