Spectral Simplicity of Apparent Complexity, Part I: The   Nondiagonalizable Metadynamics of Prediction

Paul M. Riechers; James P. Crutchfield

arXiv:1705.08042·nlin.CD·April 18, 2018

Spectral Simplicity of Apparent Complexity, Part I: The Nondiagonalizable Metadynamics of Prediction

Paul M. Riechers, James P. Crutchfield

PDF

TL;DR

This paper develops a spectral analysis framework for complex stochastic processes using meromorphic functional calculus to handle nonnormal, nondiagonalizable operators, enabling new insights into system organization and complexity measures.

Contribution

It introduces the application of meromorphic functional calculus to analyze nonnormal operators in stochastic processes, providing a foundation for explicit complexity calculations.

Findings

01

Spectral decomposition of nonnormal operators is achieved using meromorphic calculus.

02

Special properties of projection operators reveal subprocess organization.

03

Circumvents infinities in traditional spectral analysis methods.

Abstract

Virtually all questions that one can ask about the behavioral and structural complexity of a stochastic process reduce to a linear algebraic framing of a time evolution governed by an appropriate hidden-Markov process generator. Each type of question---correlation, predictability, predictive cost, observer synchronization, and the like---induces a distinct generator class. Answers are then functions of the class-appropriate transition dynamic. Unfortunately, these dynamics are generically nonnormal, nondiagonalizable, singular, and so on. Tractably analyzing these dynamics relies on adapting the recently introduced meromorphic functional calculus, which specifies the spectral decomposition of functions of nondiagonalizable linear operators, even when the function poles and zeros coincide with the operator's spectrum. Along the way, we establish special properties of the projection…

Tables3

Table 1. Table 1: Having identified the hidden linear dynamic, either a discrete-time operator T 𝑇 T or continuous-time operator G 𝐺 G , quantitative questions tend to be either cascading or accumulating type. What changes between distinct questions are the dot products with the initial setup ⟨ ⋅ | bra ⋅ \bra{\cdot} and the final observations | ⋅ ⟩ ket ⋅ \ket{\cdot} .

Linear Algebra Underlying Complexity
Question type	Discrete time	Continuous time
Cascading	$⟨ \cdot \| T^{L} \| \cdot ⟩$	$⟨ \cdot \| e^{t G} \| \cdot ⟩$
Accumulating	$⟨ \cdot \| (\sum_{L} T^{L}) \| \cdot ⟩$	$⟨ \cdot \| (\int e^{t G} 𝑑 t) \| \cdot ⟩$

Table 2. Table 2: Question genres (leftmost column) about process complexity listed with increasing sophistication. Each genre implies a different linear transition dynamic (rightmost column). Observational questions concern the superficial, given dynamic. Predictability questions are about the observation-induced dynamic over distributions; that is, over states used to generate the superficial dynamic. Prediction questions address the dynamic over distributions over a process’ causally-equivalent histories. Generation questions concern the dynamic over any nonunifilar presentation ℳ ℳ \mathcal{M} .

Questions and Their Linear Dynamics
Genre	Measures		Hidden dynamic
Observation	Correlations	$γ (L)$	HMM matrix $T$
Observation	Power spectra	$P (w)$	HMM matrix $T$
Predictability	Myopic entropy	$h_{μ} (L)$	HMM MSP
Predictability	Excess entropy	$𝐄$ , $𝐄 (w)$	matrix $W$
Prediction	Causal	$C_{μ}$ , $ℋ^{+} (L)$	$ϵ$ -Machine MSP
Prediction	synchrony	$𝐒$ , $S (w)$	matrix $𝒲$
Generation	State	$C_{μ} (ℳ)$ ,	Generator
Generation	synchrony	$ℋ (L)$ , $𝐒^{'}$	MSP matrix

Table 3. Table 3: Once we identify the hidden linear dynamic behind our questions, most are either of the cascading or accumulating type. Moreover, if a complexity measure accumulates transients, the Drazin inverse is likely to appear. Interspersed accumulation can be a helpful theoretical tool, since all derivatives and integrals of cascading type can be calculated, if we know the modified accumulation with z ∈ ℂ 𝑧 ℂ z\in\mathbb{C} . With z ∈ ℂ 𝑧 ℂ z\in\mathbb{C} , modulated accumulation involves an operator-valued z 𝑧 z -transform. However with z = e i ω 𝑧 superscript 𝑒 𝑖 𝜔 z=e^{i\omega} and ω ∈ ℝ 𝜔 ℝ \omega\in\mathbb{R} , modulated accumulation involves an operator-valued Fourier-transform.

Derivatives of cascading

↑

Integrals of cascading

↓

	Discrete time	Continuous time
Cascading	$⟨ \cdot \| A^{L} \| \cdot ⟩$	$⟨ \cdot \| e^{t G} \| \cdot ⟩$
Accumulated transients	$⟨ \cdot \| (\sum_{L} {(A - A_{1})}^{L}) \| \cdot ⟩$	$⟨ \cdot \| (\int (e^{t G} - G_{0}) 𝑑 t) \| \cdot ⟩$
modulated accumulation	$⟨ \cdot \| (\sum_{L} {(z A)}^{L}) \| \cdot ⟩$	$⟨ \cdot \| (\int {(z e^{G})}^{t} 𝑑 t) \| \cdot ⟩$

Equations246

γ (L)

γ (L)

\displaystyle P(\omega)=\lim_{N\to\infty}\frac{1}{N}\,\left\langle\,\biggl{|}\sum_{L=1}^{N}X_{L}e^{-i\omega L}\biggr{|}^{2}\right\rangle~{},

\displaystyle P(\omega)=\lim_{N\to\infty}\frac{1}{N}\,\left\langle\,\biggl{|}\sum_{L=1}^{N}X_{L}e^{-i\omega L}\biggr{|}^{2}\right\rangle~{},

H (L) = - w \in A^{L} \sum Pr (w) lo g_{2} Pr (w) .

H (L) = - w \in A^{L} \sum Pr (w) lo g_{2} Pr (w) .

h_{μ} = L \to \infty lim H (L) / L .

h_{μ} = L \to \infty lim H (L) / L .

h_{μ} (L) = H [X_{0} ∣ X_{1 - L} \dots X_{- 1}] .

h_{μ} (L) = H [X_{0} ∣ X_{1 - L} \dots X_{- 1}] .

E = I [X_{- \infty : 0}; X_{0 : \infty}] .

E = I [X_{- \infty : 0}; X_{0 : \infty}] .

T = L = 0 \sum \infty [E + h_{μ} L - H (L)] .

T = L = 0 \sum \infty [E + h_{μ} L - H (L)] .

C_{μ} = H [S_{0}^{+}] .

C_{μ} = H [S_{0}^{+}] .

H^{+} (L) = H [S_{0}^{+} ∣ X_{- L} \dots X_{- 1}] .

H^{+} (L) = H [S_{0}^{+} ∣ X_{- L} \dots X_{- 1}] .

S = L = 0 \sum \infty H^{+} (L) .

S = L = 0 \sum \infty H^{+} (L) .

H (L) = H [R_{0} ∣ X_{- L : 0}],

H (L) = H [R_{0} ∣ X_{- L : 0}],

H (0) = C (M) .

H (0) = C (M) .

\displaystyle{\bf S}^{\prime}=\sum_{L=0}^{\infty}\bigl{[}\mathcal{H}(L)-\mathcal{H}\bigr{]}~{}.

\displaystyle{\bf S}^{\prime}=\sum_{L=0}^{\infty}\bigl{[}\mathcal{H}(L)-\mathcal{H}\bigr{]}~{}.

E \leq C_{g} \leq C_{μ} .

E \leq C_{g} \leq C_{μ} .

T ∣ 1 ⟩ = ∣ 1 ⟩ .

T ∣ 1 ⟩ = ∣ 1 ⟩ .

⟨ π ∣ T = ⟨ π ∣,

⟨ π ∣ T = ⟨ π ∣,

Pr (w) = ⟨ π ∣ T^{(w)} ∣ 1 ⟩,

Pr (w) = ⟨ π ∣ T^{(w)} ∣ 1 ⟩,

Pr (X_{t : t + L} = w ∣ R_{t} \sim η) = ⟨ η ∣ T^{(w)} ∣ 1 ⟩,

Pr (X_{t : t + L} = w ∣ R_{t} \sim η) = ⟨ η ∣ T^{(w)} ∣ 1 ⟩,

Pr (X = w ∣ R = ρ_{k}) \neq = Pr (X = w ∣ R = ρ_{j}) .

Pr (X = w ∣ R = ρ_{k}) \neq = Pr (X = w ∣ R = ρ_{j}) .

T_{t_{0} \to t_{0} + t} = e^{tG} .

T_{t_{0} \to t_{0} + t} = e^{tG} .

Γ_{x} = ρ \in R \sum δ_{x, f (ρ)} ∣ δ_{ρ} ⟩ ⟨ δ_{ρ} ∣,

Γ_{x} = ρ \in R \sum δ_{x, f (ρ)} ∣ δ_{ρ} ⟩ ⟨ δ_{ρ} ∣,

R_{π} = w \in L ⋃ \frac{⟨ π ∣ T ^{(w)}}{⟨ π ∣ T ^{(w)} ∣ 1 ⟩} .

R_{π} = w \in L ⋃ \frac{⟨ π ∣ T ^{(w)}}{⟨ π ∣ T ^{(w)} ∣ 1 ⟩} .

⟨ η^{010} ∣

⟨ η^{010} ∣

⟨ η ⟩ T^{(x)} 1 > 0

⟨ η ⟩ T^{(x)} 1 > 0

⟨ η^{'} ∣ = \frac{⟨ η ∣ T ^{(x)}}{⟨ η ⟩ T ^{(x)} 1},

⟨ η^{'} ∣ = \frac{⟨ η ∣ T ^{(x)}}{⟨ η ⟩ T ^{(x)} 1},

Pr (η^{'}, x ∣ η)

Pr (η^{'}, x ∣ η)

= ⟨ η ⟩ T^{(x)} 1 .

\displaystyle\mathscr{S}\text{-MSP}(\mathcal{M})=\bigl{(}\bm{\mathcal{R}}_{\pi},\mathcal{A},\{W^{({x})}\}_{{x}\in\mathcal{A}},\delta_{\pi}\bigr{)}~{},

\displaystyle\mathscr{S}\text{-MSP}(\mathcal{M})=\bigl{(}\bm{\mathcal{R}}_{\pi},\mathcal{A},\{W^{({x})}\}_{{x}\in\mathcal{A}},\delta_{\pi}\bigr{)}~{},

γ (L)

γ (L)

⟨ \overline{X}_{t} X_{t} ⟩_{t}

⟨ \overline{X}_{t} X_{t} ⟩_{t}

= x \in A \sum ∣ x ∣^{2} ⟨ π ∣ T^{(x)} ∣ 1 ⟩ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Spectral Simplicity of Apparent Complexity, Part I:

The Nondiagonalizable Metadynamics of Prediction

Paul M. Riechers

[email protected]

James P. Crutchfield

[email protected]

Complexity Sciences Center

Department of Physics

University of California at Davis

One Shields Avenue, Davis, CA 95616

Abstract

Virtually all questions that one can ask about the behavioral and structural complexity of a stochastic process reduce to a linear algebraic framing of a time evolution governed by an appropriate hidden-Markov process generator. Each type of question—correlation, predictability, predictive cost, observer synchronization, and the like—induces a distinct generator class. Answers are then functions of the class-appropriate transition dynamic. Unfortunately, these dynamics are generically nonnormal, nondiagonalizable, singular, and so on. Tractably analyzing these dynamics relies on adapting the recently introduced meromorphic functional calculus, which specifies the spectral decomposition of functions of nondiagonalizable linear operators, even when the function poles and zeros coincide with the operator’s spectrum. Along the way, we establish special properties of the projection operators that demonstrate how they capture the organization of subprocesses within a complex system. Circumventing the spurious infinities of alternative calculi, this leads in the sequel, Part II, to the first closed-form expressions for complexity measures, couched either in terms of the Drazin inverse (negative-one power of a singular operator) or the eigenvalues and projection operators of the appropriate transition dynamic.

hidden Markov model, entropy rate, excess entropy, predictable information, statistical complexity, projection operator, complex analysis, resolvent, Drazin inverse

pacs:

02.50.-r 89.70.+c 05.45.Tp 02.50.Ey 02.50.Ga

††preprint: Santa Fe Institute Working Paper 17-05-XXX††preprint: arxiv.org:1705.XXXX [nlin.CD]

I Introduction
II Structured Processes and their Complexities
II.1 Directly observable organization
II.2 Intrinsic predictability
II.3 Prediction overhead
II.4 Generative complexities
III Hidden Markov Models
III.1 Unifilar HMMs
III.2 Minimal unifilar HMMs
III.3 Finitary stochastic process hierarchy
III.4 Continuous-time HMMs
IV Mixed-State Presentations
V Identifying the Hidden Linear Dynamic
V.1 Simple complexity from any presentation
V.2 Predictability from a presentation MSP
V.3 Continuous time?
V.4 Synchronization from generator MSP
V.5 Optimal prediction from $\epsilon$ -machine MSP
V.6 Beyond the MSP
V.7 The end?
VI Spectral Theory beyond the Spectral Theorem
VI.1 Spectral primer
VI.2 Eigenprojectors: Left, right, generalized
VI.3 Companion operators and resolvent decomposition
VI.4 Functions of nondiagonalizable operators
VI.5 Evaluating residues
VI.6 Decomposing $A^{L}$
VI.7 Drazin inverse
VII Projection Operators for Stochastic Dynamics
VII.1 Row sums
VII.2 Expected stationary distribution
VIII Spectra by inspection
VIII.1 Eigenvalues
VIII.2 Eigenprojectors from graph structure
IX Conclusion

I Introduction

Complex systems—that is, many-body systems with strong interactions—are usually observed through low-resolution feature detectors. The consequence is that their hidden structure is, at best, only revealed over time. Since individual observations cannot capture the full resolution of each degree of freedom, let alone a sufficiently full set of them, measurement time series often appear stochastic and non-Markovian, exhibiting long-range correlations. Empirical challenges aside, restricting to the purely theoretical domain, even finite systems can appear quite complicated. Despite admitting finite descriptions, stochastic processes with sofic support, to take one example, exhibit infinite-range dependencies among the chain of random variables they generate [1]. While such infinite-correlation processes are legion in complex physical and biological systems, even approximately analyzing them is generally appreciated as difficult, if not impossible. Generically, even finite systems lead to uncountably infinite sets of predictive features [2]. These facts seem to put physical sciences’ most basic goal—prediction—out of reach.

We aim to show that this direct, but sobering conclusion is too bleak. Rather, there is a collection of constructive methods that address hidden structure and the challenges associated with predicting complex systems. This follows up on our recent introduction of a functional calculus that uncovered new relationships among supposedly different complexity measures [3] and that demonstrated the need for a generalized spectral theory to answer such questions [4]. Those efforts yielded elegant, closed-form solutions for complexity measures that, when compared, offered insight into the overall theory of complexity measures. Here, providing the necessary background for and greatly expanding those results, we show that different questions regarding correlation, predictability, and prediction each require their own analytical structures, expressed as various kinds of hidden transition dynamic. The resulting transition dynamic among hidden variables summarizes symmetry breaking, synchronization, and information processing, for example. Each of these metadynamics, though, is built up from the original given system.

The shift in perspective that allows the new level of tractability begins by recognizing that—beyond their ability to generate many sophisticated processes of interest—hidden Markov models can be treated as exact mathematical objects when analyzing the processes they generate. Crucially, and especially when addressing nonlinear processes, most questions that we ask imply a linear transition dynamic over some hidden state space. Speaking simply, something happens, then it evolves linearly in time, then we snapshot a selected characteristic. This broad type of sequential questioning cascades, in the sense that the influence of the initial preparation cascades through state space as time evolves, affecting the final measurement. Alternatively, other, complementary kinds of questioning involve accumulating such cascades. The linear algebra underlying either kind is highlighted in Table 1 in terms of an appropriate discrete-time transition operator $T$ or a continuous-time generator $G$ of time evolution.

In this way, deploying linear algebra to analyze complex systems turns on identifying an appropriate hidden state space. And, in turn, the latter depends on the genre of the question. Here, we focus on closed-form expressions for a process’ complexity measures. This determines what the internal system setup $\bra{\cdot}$ and the final detection $\ket{\cdot}$ should be. We show that complexity questions fall into three subgenres and, for each of these, we identify the appropriate linear dynamic and closed-form expressions for several of the key questions in each genre. See Table 2. The burden of the following is to explain the table in detail. We return to a much-elaborated version at the end.

Associating observables ${x}\in\mathcal{A}$ with transitions between hidden states $s\in\bm{\mathcal{S}}$ , gives a hidden Markov model (HMM) with observation-labeled transition matrices $\bigl{\{}T^{({x})}:T^{({x})}_{i,j}=\Pr({x},s_{j}|s_{i})\bigr{\}}_{{x}\in\mathcal{A}}$ . They sum to the row-stochastic state-to-state transition matrix $T=\sum_{{x}\in\mathcal{A}}T^{({x})}$ . (The continuous-time versions are similarly defined, which we do later on.) Adding measurement symbols ${x}\in\mathcal{A}$ this way—to transitions—can be considered a model of measurement itself 111While we follow Shannon [12] in this, it differs from the more widely used state-labeled HMMs.. The efficacy of our choice will become clear.

It is important to note that HMMs, in continuous and discrete time, arise broadly in the sciences, from quantum mechanics [6, 7], statistical mechanics [8], and stochastic thermodynamics [9, 10, 11] to communication theory [12, 13], information processing [14, 15, 16], computer design [17], population and evolutionary dynamics [18, 19], and economics. Thus, HMMs appear in the most fundamental physics and in the most applied engineering and social sciences. The breadth suggests that the thorough-going HMM analysis developed here is worth the required effort.

Since complex processes have highly structured, directional transition dynamics— $T$ or $G$ —we encounter the full richness of matrix algebra in analyzing HMMs. We explain how analyzing complex systems induces a nondiagonalizable metadynamics, even if the original dynamic is diagonalizable in its underlying state-space. Normal and diagonalizable restrictions, so familiar in mathematical physics, simply fail us here.

The diversity of nondiagonalizable dynamics presents a technical challenge, though. A new calculus for functions of nondiagonalizable operators—e.g., $T^{L}$ or $e^{tG}$ —becomes a necessity if one’s goal is an exact analysis of complex processes. Moreover, complexity measures naively and easily lead one to consider illegal operations. Taking the inverse of a singular operator is a particularly central, useful, and fraught example. Fortunately, such illegal operations can be skirted since the complexity measures only extract the excess transient behavior of an infinitely complicated orbit space.

To explain how this arises—how certain modes of behavior, such as excess transients, are selected as relevant, while others are ignored—Ref. [4] recently developed a meromorphic functional calculus for analyzing complex processes generated by HMMs. The following shows that this leads to a general spectral theory of weighted directed graphs and that, more specifically, the techniques can be applied to the challenges of prediction. The results developed here greatly extend and (finally) explain those announced in Ref. [3]. The latter introduced the basic methods and results by narrowly focusing on closed-form expressions for several measures of intrinsic computation, applying them to prototype complex systems.

The meromorphic functional calculus, summarized in detail later, concerns functions of nondiagonalizable operators when poles (or zeros) of the function of interest coincide with poles of the operator’s resolvent—poles that appear precisely at the eigenvalues of the transition dynamics. Pole–pole and pole–zero interactions transform the complex-analysis residues within the functional calculus. One notable result is that the negative-one power of a singular operator exists in the meromorphic functional calculus. We derive its form, note that it is the Drazin inverse, and show how widely useful and common it is.

For example, the following gives the first closed-form expressions for many complexity measures in wide use—many of which turn out to be expressed most concisely in terms of a Drazin inverse. Furthermore, spectral decomposition gives insight into subprocesses of a complex system in terms of the projection operators of the appropriate transition dynamic.

To get started, sections §II through §III briefly review relevant background in stochastic processes, the HMMs that generate them, and complexity measures. Several classes of HMMs are discussed in §III. Mixed-state presentations (MSPs)—HMM generators of a process that also track distributions induced by observation—are reviewed in §IV. They are key to complexity measures within an information-theoretic framing. Section §V then shows how each complexity measure reduces to the linear algebra of an appropriate HMM adapted to the question genre.

To make progress at this point, we summarize the meromorphic functional calculus in §VI. Several of its mathematical implications are discussed in relation to projection operators in §VII and a spectral weighted directed graph theory is presented in §VIII.

With this all set out, the sequel Part II finally derives the promised closed-form complexities of a process and outlines common simplifications for special cases. Leveraging the functional calculus, it introduces a novel extension—the complexity measure frequency spectrum and shows how to calculate it in closed form. It provides a suite of examples to ground the theoretical developments and works through in-depth a pedagogical example.

II Structured Processes and their Complexities

We first describe a system of interest in terms of its observed behavior, following the approach of computational mechanics, as reviewed in Ref. [20]. Again, a process is the collection of behaviors that the system produces and their probabilities of occurring. A process’s behaviors are described via a bi-infinite chain of random variables, denoted by capital letters $\ldots\,{X}_{t-2}\,{X}_{t-1}\,{X}_{t}\,{X}_{t+1}\,{X}_{t+2}\ldots$ . A realization is indicated by lowercase letters $\ldots\,{x}_{t-2}\,{x}_{t-1}\,{x}_{t}\,{x}_{t+1}\,{x}_{t+2}\ldots$ . We assume values ${x}_{t}$ belong to a discrete alphabet $\mathcal{A}$ . We work with blocks ${X}_{t:t^{\prime}}$ , where the first index is inclusive and the second exclusive: ${X}_{t:t^{\prime}}={X}_{t}\ldots{X}_{t^{\prime}-1}$ . Block realizations ${x}_{t:t^{\prime}}$ we often refer to as words $w$ . At each time $t$ , we can speak of the past ${X}_{-\infty:t}=\ldots{X}_{t-2}{X}_{t-1}$ and the future ${X}_{t:\infty}={X}_{t}{X}_{t+1}\ldots$ .

A process’s probabilistic specification is a density over these chains: $\mathbb{P}({X}_{-\infty:\infty})$ . Practically, we work with finite blocks and their probability distributions $\Pr({X}_{t:t^{\prime}})$ . To simplify the development, we primarily analyze stationary, ergodic processes: those for which $\Pr({X}_{t:t+L})=\Pr({X}_{0:L})$ for all $t\in\mathbb{Z}$ , $L\in\mathbb{Z}^{+}$ , and all realizations. In such cases, we only need to consider a process’s length- $L$ word distributions $\Pr({X}_{0:L})$ .

II.1 Directly observable organization

A common first step to understand how processes express themselves is to analyze correlations among observables. Pairwise correlation in a sequence of observables is often summarized by the autocorrelation function:

[TABLE]

where the bar above ${X}_{t}$ denotes its complex conjugate, and the angled brackets denote an average over all times $t\in\mathbb{Z}$ . Alternatively, structure in a stochastic process is often summarized by the power spectral density, also referred to more simply as the power spectrum:

[TABLE]

where $\omega\in\mathbb{R}$ is the angular frequency [21]. Though a basic fact, it is not always sufficiently emphasized in applications that power spectra capture only pairwise correlation. Indeed, it is straightforward to show that the power spectrum $P(\omega)$ is the windowed Fourier transform of the autocorrelation function $\gamma(L)$ . That is, power spectra describe how pairwise correlations are distributed across frequencies. Power spectra are common in signal processing, both in technological settings and physical experiments [22]. As a physical example, diffraction patterns are the power spectra of a sequence of structure factors [23].

To monitor transport properties in near-equilibrium thermodynamic systems, the Green–Kubo coefficients are another important example measure of observable organization, but are rather more application-specific [24, 25]. These coefficients reflect the idea that dissipation depends on correlation structure. They usually appear in the form of integrating the autocorrelation of derivatives of observables. A change of observables, however, turns this into an integration of a standard autocorrelation function. Green–Kubo transport coefficients then involve the limit $\lim_{\omega\to 0}P(\omega)$ for the process of appropriate observables.

One theme in the following is that, though widely used, correlation functions and power spectra give an impoverished view of a process’s structural complexity, since they only consider ensemble averages over pairwise events. Moreover, creating a list of higher-order correlations is an impractical way to summarize complexity, as seen in the connected correlation functions of statistical mechanics [26].

II.2 Intrinsic predictability

Information measures, in contrast, can involve all orders of correlation and thus help to go beyond pairwise correlation in understanding, for example, how a process’ past behavior affects predicting it at later times. Information theory, as developed for general complex processes [1], provides a suite of quantities that capture prediction properties using variants of Shannon’s entropy $\operatorname{H}[\cdot]$ and mutual information $\operatorname{I}[\,\cdot\,;\cdot\,]$ [13] applied to sequences. Each measure answers a specific question about a process’ predictability. For example:

•

How much information is contained in the words generated? The block entropy [1]:

[TABLE]

•

How random is a process? Its entropy rate [27]:

[TABLE]

•

How is the irreducible randomness $h_{\mu}$ approached? Via the myopic entropy rates [1]:

[TABLE]

•

How much of the future can be predicted? Its excess entropy [1]:

[TABLE]

•

How much information must be extracted to know its predictability and so see its intrinsic randomness $h_{\mu}$ ? Its transient information [1]:

[TABLE]

The spectral approach, our subject, naturally leads to allied, but new information measures. To give a sense, later we introduce the excess entropy spectrum ${\bf E}(\omega)$ . It completely, yet concisely, summarizes the structure of myopic entropy reduction, in a way similar to how the power spectrum completely describes autocorrelation. However, while the power spectrum summarizes only pairwise linear correlation, the excess entropy spectrum captures all orders of nonlinear dependency between random variables, making it an incisive probe of hidden structure.

Before leaving the measures related to predictability, we must also point out that they have important refinements—measures that lend a particularly useful, even functional, interpretation. These include the bound, ephemeral, elusive, and related informations [28, 29]. Though amenable to the spectral methods of the following, we leave their discussion for another venue. Fortunately, their spectral development is straightforward, but would take us beyond the minimum necessary presentation to make good on the overall discussion of spectral decomposition.

II.3 Prediction overhead

Process predictability measures, as just enumerated, certainly say much about a process’ intrinsic information processing. They leave open, though, the question of the structural complexity associated with implementing prediction. This challenge entails a complementary set of measures that directly address the inherent complexity of actually predicting what is predictable. For that matter, how cryptic is a process?

Computational mechanics describes optimal prediction via a process’ hidden, effective or causal states and transitions, as summarized by the process’s $\epsilon$ -machine [20]. A causal state $\sigma\in\bm{\mathcal{S}}^{+}$ is an equivalence class of histories ${X}_{-\infty:0}$ that all yield the same probability distribution over observable futures ${X}_{0:\infty}$ . Therefore, knowing a process’s current causal state—that $\mathcal{S}_{0}^{+}=\sigma$ , say—is sufficient for optimal prediction.

Computational mechanics provides an additional suite of quantities that capture the overhead of prediction, again using variants of Shannon’s entropy and mutual information applied to the $\epsilon$ -machine. Each also answers a specific question about an observer’s burden of prediction. For example:

•

How much historical information must be stored for optimal prediction? The statistical complexity [30]:

[TABLE]

•

How unpredictable is a causal state upon observing a process for duration $L$ ? The myopic causal-state uncertainty [1]:

[TABLE]

•

How much information must an observer extract to synchronize to—that is, to know with certainty—the causal state? The optimal predictor’s synchronization information [1]:

[TABLE]

Paralleling the purely informational suite of the previous section, we later introduce the optimal synchronization spectrum ${\bf S}(\omega)$ . It completely and concisely summarizes the frequency distribution of state-uncertainty reduction, similar to how the power spectrum $P(\omega)$ completely describes autocorrelation and the excess entropy spectrum ${\bf E}(\omega)$ the myopic entropy reduction. Helpfully, the above optimal prediction measures can be found from the optimal synchronization spectrum.

The structural complexities monitor an observer’s burden in optimally predicting a process. And so, they have practical relevance when an intelligent artificial or biological agent must take advantage of a structured stochastic environment—e.g., a Maxwellian Demon taking advantage of correlated environmental fluctuations [31], prey avoiding easy prediction, or profiting from stock market volatility, come to mind.

Prediction has many natural generalizations. For example, since optimal prediction often requires infinite resources, suboptimal prediction is of practical interest. Fortunately, there are principled ways to investigate the tradeoffs between predictive accuracy and computational burden [32, 33, 34, 2]. As another example, optimal prediction in the presence of noisy or irregular observations can be investigated with a properly generalized framework; see Ref. [35]. Blending the existing tools, resource-limited prediction under such observational constraints can also be investigated. In all of these settings, information measures similar to those listed above are key to understanding and quantifying the tradeoffs arising in prediction.

Having highlighted the difference between prediction and predictability, we can appreciate that some processes hide more internal information—are more cryptic—than others. It turns out, this can be quantified. The crypticity $\chi=C_{\mu}-{\bf E}$ is the difference between the a process’s stored information $C_{\mu}$ and the mutual information ${\bf E}$ shared between past and future observables [36]. Operationally, crypticity contrasts predictable information content ${\bf E}$ with an observer’s minimal stored-memory overhead $C_{\mu}$ required to make predictions. To predict what is predictable, therefore, an optimal predictor must account for a process’s crypticity.

II.4 Generative complexities

How does a physical system produce its output process? This depends on many details. Some systems employ vast internal mechanistic redundancy, while others under constraints have optimized internal resources down to a minimally necessary generative structure. Different pressures give rise to different kinds of optimality. For example, minimal state-entropy generators turn out to be distinct from minimal state-set generators [37, 38, 39]. The challenge then is to develop ways to monitor differences in generative mechanism.

Any generative model [40, 1] $\mathcal{M}$ with state-set $\bm{\mathcal{R}}$ has a statistical complexity (state entropy): $C(\mathcal{M})=\operatorname{H}[\mathcal{R}]$ . Consider the corresponding myopic state-uncertainty given $L$ sequential observations:

[TABLE]

And so:

[TABLE]

We also have the asymptotic uncertainty $\mathcal{H}\equiv\lim_{L\to\infty}\mathcal{H}(L)$ . Related, there is the excess synchronization information:

[TABLE]

Such quantities are relevant even when an observer never fully synchronizes to a generative state; i.e., even when $\mathcal{H}>0$ . Finite-state $\epsilon$ -machines always synchronize [41, 42] and so their $\mathcal{H}$ vanishes.

Since many different mechanisms can generate a given process, we need useful bounds on the statistical complexity of possible process generators. For example, the minimal generative complexity $C_{\text{g}}=\min_{\{\bm{\mathcal{R}}\}}C(\mathcal{M})$ is the minimal state-information a physical system must store to generate its future [39]. The predictability and the statistical complexities bound each other:

[TABLE]

That is, the predictable future information ${\bf E}$ is less than or equal to the information $C_{\text{g}}$ necessary to produce the future which, in turn, is less than or equal to the information $C_{\mu}$ necessary to predict the future [37, 38, 1, 39]. Such relationships have been explored even for quantum generators of (classical) stochastic processes [43, and references therein].

III Hidden Markov Models

Up to this point, the development focused on introducing and interpreting various information and complexity measures. It was not constructive in that there was no specification of how to calculate these quantities for a given process. To do so requires models or, in the vernacular, a presentation of a process. Fortunately, a common mathematical representation describes a wide class of process generators: the edge-labeled hidden Markov models (HMMs), also known as a Mealy HMMs [40] 222Contrast this with the class-equivalent state-labeled HMMs, also known as Moore HMMs [63, 38, 64]. In automata theory, a finite-state HMM is called a probabilistic nondeterministic finite automaton [65]. Information theory [13] refers to them as finite-state information sources. And, stochastic process theory defines them as functions of a Markov chain [50, 66, 46, 67]. Using these as our preferred presentations, we will first classify them and then describe how to calculate the information measures of the processes they generate.

Definition 1.

A finite-state, edge-labeled hidden Markov model $\mathcal{M}=\bigl{\{}\bm{\mathcal{R}},\mathcal{A},\{T^{({x})}\}_{{x}\in\mathcal{A}},\eta_{0}\bigr{\}}$ consists of:

•

A finite set of hidden states $\bm{\mathcal{R}}=\left\{\rho_{1},\ldots,\rho_{M}\right\}$ . $\mathcal{R}_{t}$ is the random variable for the hidden state at time $t$ .

•

A finite output alphabet $\mathcal{A}$ .

•

A set of $M\times M$ symbol-labeled transition matrices $\bigl{\{}T^{({x})}\bigr{\}}_{{x}\in\mathcal{A}}$ , where $T^{({x})}_{i,j}=\Pr({x},\rho_{j}|\rho_{i})$ is the probability of transitioning from state $\rho_{i}$ to state $\rho_{j}$ and emitting symbol ${x}$ . The corresponding overall state-to-state transition matrix is the row-stochastic matrix $T=\sum_{{x}\in\mathcal{A}}T^{({x})}$ .

•

An initial distribution over hidden states: $\eta_{0}=\bigl{(}\Pr(\mathcal{R}_{0}=\rho_{1}),\Pr(\mathcal{R}_{0}=\rho_{2}),...,\Pr(\mathcal{R}_{0}=\rho_{M})\bigr{)}$ .

The dynamics of such finite-state models are governed by transition matrices amenable to the linear algebra of vector spaces. As a result, bra-ket notation is useful [45]. Bras $\bra{\cdot}$ are row vectors and kets $\ket{\cdot}$ are column vectors. One benefit of the notation is immediately recognizing mathematical object type. For example, on the one hand, any expression that forms a closed bra-ket pair—either $\braket{\cdot}{\cdot}$ or $\bra{\cdot}\cdot\ket{\cdot}$ —is a scalar quantity and commutes as a unit with anything. On the other hand, when useful, an expression of the ket-bra form $\ket{\cdot}\bra{\cdot}$ can be interpreted as a matrix.

$T$ ’s row-stochasticity means that each of its rows sum to unity. Introducing $\ket{\mathbf{1}}$ as the column vector of all 1s, this can be restated as:

[TABLE]

This is readily recognized as an eigenequation: $T\ket{\eta}=\lambda\ket{\eta}$ . That is, the all-ones vector $\ket{\mathbf{1}}$ is always a right eigenvector of $T$ associated with the eigenvalue $\lambda$ of unity.

When the internal Markov transition matrix $T$ is irreducible, the Perron-Frobenius theorem guarantees that there is a unique asymptotic state distribution $\pi$ determined by:

[TABLE]

with the further condition that $\pi$ is normalized in probability: $\braket{\pi}{\mathbf{1}}=1$ . This again is recognized as an eigenequation: the asymptotic distribution $\pi$ over the hidden states is $T$ ’s left eigenvector associated with the eigenvalue of unity.

To describe a stationary process, as done often in the following, the initial hidden-state distribution $\eta_{0}$ is set to the asymptotic one: $\eta_{0}=\pi$ . The resulting process generated is then stationary. Choosing an alternative $\eta_{0}$ is useful in many contexts, but yields a nonstationary process. We avoid this for now for simplicity.

An HMM $\mathcal{M}$ describes a process’ behaviors as a formal language $\mathcal{L}\subseteq\bigcup_{\ell=1}^{\infty}\mathcal{A}^{\ell}$ of allowed realizations. Moreover, $\mathcal{M}$ succinctly describes a process’s word distribution $\Pr(w)$ over all words $w\in\mathcal{L}$ . (Appropriately, $\mathcal{M}$ also assigns zero probability to words outside of the process’ language: $\Pr(w)=0$ for all $w\in\mathcal{L}^{\text{c}}$ , $\mathcal{L}$ ’s complement.) Specifically, the stationary probability of observing a particular length- $L$ word $w={x}_{0}{x}_{1}\ldots{x}_{L-1}$ is given by:

[TABLE]

where $T^{(w)}\equiv T^{({x}_{0})}T^{({x}_{1})}\dotsm T^{({x}_{L-1})}$ .

More generally, given a nonstationary state distribution $\eta$ , the subsequent probability of a word is:

[TABLE]

where $\mathcal{R}_{t}\sim\eta$ means that the random variable $\mathcal{R}_{t}$ is distributed as $\eta$ [13]. This conditional word probability is used often since, for example, most observations induce a nonstationary distribution over hidden states. Tracking such observation-induced distributions is the role of a related model class—the mixed-state presentation, introduced shortly. To get there, we must first introduce several, prerequisite HMM classes. See Fig. 1. The general HMM just discussed is shown in Fig. 1a.

III.1 Unifilar HMMs

An important class of HMMs consists of those that are unifilar. Unifilarity guarantees that, given a start state and a sequence of observations, there is a unique path through the internal states [46]. This, in turn, allows one to directly translate properties of the internal Markov chain into properties of the observed behavior generated from the sequence of edges traversed. Unifilar HMMs are a process’ optimal predictors [47].

In contrast, general—that is, nonunifilar—HMMs have an exponentially growing number of possible state paths as a function of observed word length. Thus, nonunifilar process presentations break most all quantitative connections between internal dynamics and observations, rendering them markedly less useful process presentations. While they can be used to generate realizations of a given process, they cannot be used to predict a process. Unifilarity is required.

Definition 2.

A finite-state, edge-labeled, unifilar HMM (uHMM) 333Automata theory would refer to a uHMM as a probabilistic deterministic finite automaton [65]. The awkward terminology does not recommend itself. is a finite-state, edge-labeled HMM with the following property:

•

Unifilarity*: For each state $\rho\in\bm{\mathcal{R}}$ and each symbol ${x}\in\mathcal{A}$ there is at most one outgoing edge from state $\rho$ that emits symbol ${x}$ .*

An example is shown in Fig. 1b.

III.2 Minimal unifilar HMMs

Minimal models are not only convenient to use, but very often allow for determining essential informational properties, such as a process’ memory $C_{\mu}$ . A process’ minimal state-entropy uHMM is the same as its minimal-state uHMM. And, the latter turns out to be the process’ $\epsilon$ -machine in computational mechanics [20]. Computational mechanics shows how to calculate a process’ $\epsilon$ -machine from the process’ conditional word distributions. Specifically, $\epsilon$ -machine states, the process’ causal states $\sigma\in\bm{\mathcal{S}}$ , are equivalence classes of histories that yield the same predictions for the future. Explicitly, two histories $\smash{\overleftarrow{{x}}}$ and ${\smash{\overleftarrow{{x}}}}^{\prime}$ map to the same causal state $\epsilon(\smash{\overleftarrow{{x}}})=\epsilon({\smash{\overleftarrow{{x}}}}^{\prime})=\sigma$ if and only if $\Pr(\smash{\overrightarrow{{X}}}|\smash{\overleftarrow{{x}}})=\Pr(\smash{\overrightarrow{{X}}}|{\smash{\overleftarrow{{x}}}}^{\prime})$ . Thus, each causal state comes with a prediction of the future $\Pr(\smash{\overrightarrow{{X}}}|\sigma)$ —its future morph. In short, a process’ $\epsilon$ -machine is its minimal size, optimal predictor.

Converting a given uHMM to its corresponding $\epsilon$ -machine employs probabilistic variants of well-known state-minimization algorithms in automata theory [49]. One can also verify that a given uHMM is minimal by checking that all its states are probabilistically distinct [41, 42].

Definition 3.

A uHMM’s states are probabilistically distinct if for each pair of distinct states $\rho_{k},\rho_{j}\in\bm{\mathcal{R}}$ there exists some finite word $w={x}_{0}{x}_{1}\ldots{x}_{L-1}$ such that:

[TABLE]

If this is the case, then the process’ uHMM is its $\epsilon$ -machine.

An example is shown in Fig. 1c.

III.3 Finitary stochastic process hierarchy

The finite-state presentations in these classes form a hierarchy in terms of the processes they can finitely generate [37]: Processes( $\epsilon$ -machines) $=$ Processes(uHMMs) $\subset$ Processes(HMMs). That is, finite HMMs generate a strictly larger class of stochastic processes than finite uHMMs. The class of processes generated by finite uHMMs, though, is the same as generated by finite $\epsilon$ -machines.

III.4 Continuous-time HMMs

Though we concentrate on discrete-time processes, many of the process classifications, properties, and calculational methods carry over easily to continuous time. In this setting transition rates are more appropriate than transition probabilities. Continuous-time HMMs can often be obtained as a discrete-time limit $\Delta t\to 0$ of an edge-labeled HMM whose edges operate for a time $\Delta t$ . The most natural continuous-time HMM presentation, though, has a continuous-time generator $G$ of time evolution over hidden states, with observables emitted as deterministic functions of an internal Markov chain: $f:\bm{\mathcal{S}}\to\mathcal{A}$ .

Respecting the continuous-time analogue of probability conservation, each row of $G$ sums to zero. Over a finite time interval $t$ , marginalizing over all possible observations, the row-stochastic state-to-state transition dynamic is:

[TABLE]

The generated process, a function of the internal continuous-time Markov chain, can also be specified by a set of transition matrices. For this purpose we introduce the continuous-time observation matrices:

[TABLE]

where $\delta_{x,f(\rho)}$ is a Kronecker delta, $\ket{\delta_{\rho}}$ the column vector of all zeros except for a one at the position for state $\rho$ , and $\bra{\delta_{\rho}}$ its transpose $\bigl{(}\ket{\delta_{\rho}}\bigr{)}^{\top}$ . These “projectors” sum to the identity: $\sum_{x\in\mathcal{A}}\Gamma_{x}=I$ .

An example is shown in Fig. 1d.

IV Mixed-State Presentations

A given process can be generated by nonunifilar, unifilar, and $\epsilon$ -machine HMM presentations. Within either the unifilar or nonunifilar HMM classes, there can be an unbounded number of presentations that generate the process. A process’ $\epsilon$ -machine is unique, however.

This flexibility suggests that we can create a HMM process generator to answer more refined questions than information generation ( $h_{\mu}$ ) and memory ( $C_{\mu}$ ) calculated from the $\epsilon$ -machine. To this end, we introduce the mixed-state presentation (MSP). An MSP tracks important supplementary information in the hidden states and, through well-crafted dynamics, over the hidden states. In particular, an MSP generates a process while tracking the observation-induced distribution over the states of an alternative process generator. Here, we review only that subset of mixed-state theory required by the following.

Consider a HMM presentation $\mathcal{M}=\bigl{(}\bm{\mathcal{R}},\mathcal{A},\{T^{(x)}\}_{x\in\mathcal{A}},\pi\bigr{)}$ of some process in statistical equilibrium. A mixed state $\eta$ can be any state distribution over $\bm{\mathcal{R}}$ , but the uncountable set of points in the most general state-distribution simplex is infinitely more than needed to calculate many complexity measures. How to monitor the way in which an observer comes to know the HMM state as it sees successive symbols from the process? This is the problem of observer-state synchronization. To analyze this evolution of the observer’s knowledge, we use the set $\bm{\mathcal{R}}_{\pi}$ of mixed states that are induced by all allowed words $w\in\mathcal{L}$ from initial mixed state $\eta_{0}=\pi$ :

[TABLE]

The cardinality of $\bm{\mathcal{R}}_{\pi}$ is finite when there are only a finite number of distinct probability distributions over $\mathcal{M}$ ’s states that can be induced by observed sequences, if starting from the stationary distribution $\pi$ .

If $w$ is the first (in lexicographic order) word that induces a particular distribution over $\bm{\mathcal{R}}$ , then we denote this distribution as $\eta^{w}$ . For example, if the two words $010$ and $110110$ both induce the same distribution $\eta$ over $\bm{\mathcal{R}}$ and no word shorter than $010$ induces that distribution, then the mixed state is denoted $\eta^{010}$ . It corresponds to the distribution:

[TABLE]

Since a given observed symbol induces a unique updated distribution from a previous distribution, the dynamic over mixed states is unifilar. Transition probabilities among mixed states can be obtained via Eq. (2). So, if:

[TABLE]

and:

[TABLE]

then:

[TABLE]

These transition probabilities over the mixed states in $\bm{\mathcal{R}}_{\pi}$ are the matrix elements for the observation-labeled transition matrices $\{W^{({x})}\}_{{x}\in\mathcal{A}}$ of $\mathcal{M}$ ’s synchronizing MSP ( $\mathscr{S}$ -MSP):

[TABLE]

where $\delta_{\pi}$ is the distribution over $\bm{\mathcal{R}}_{\pi}$ peaked at the unique start-(mixed)-state $\pi$ . The row-stochastic net mixed-state-to-state transition matrix of $\mathscr{S}$ -MSP( $\mathcal{M}$ ) is $W=\sum_{{x}\in\mathcal{A}}W^{({x})}$ . If irreducible, then there is a unique stationary probability distribution $\bra{\pi_{W}}$ over $\mathscr{S}$ -MSP( $\mathcal{M}$ )’s states obtained by solving $\bra{\pi_{W}}=\bra{\pi_{W}}W$ . We use $\mathcal{R}_{t}$ to denote the random variable for the MSP’s state at time $t$ .

More generally, we must consider a mixed-state dynamic that starts from a nonpeaked distribution over the hidden-state distribution simplex. This may be counterintuitive, since a distribution over distributions should correspond to a single distribution. However, general MSP theory with a nonpeaked starting distribution over the simplex allows us to consider a weighted average of behaviors originating from disparate histories. And, this is distinct from considering the behavior originating from a weighted average of histories. This more general MSP formalism arises in the closed-form solutions for more sophisticated complexity measures, such as the bound information. This appears in a sequel.

With this brief overview of mixed states, we can now turn to use them. Section § V shows that tracking distributions over the states of another generator makes the MSP an ideal algebraic object for closed-form complexity expressions involving conditional entropies—measures that require conditional probabilities. Sections § II.2 and § II.3 showed that many of the complexity measures for predictability and predictive burden are indeed framed as conditional entropies. And so, MSPs are central to their closed-form expressions.

Historically, mixed states were already implicit in Ref. [50], introduced in their modern form by Ref. [37, 38], and have been used recently; e.g., in Refs. [51, 52]. Most of these efforts, however, used mixed-states in the specific context of the synchronizing MSP ( $\mathscr{S}$ -MSP). A greatly extended development of mixed-state dynamics appears in Ref. [35]. Different information-theoretic questions require different mixed-state dynamics, each of which is a unifilar presentation. Employing the mathematical methods developed here, we find that desired closed-form solutions are often simple functions of the transition dynamic of an appropriate MSP. The spectral character of the relevant MSP controls the behavior of information-theoretic quantities.

Finally, we emphasize that similar linear algebraic constructions—where hidden states track relevant information—that are nevertheless not MSPs are just as important for answering a different set of questions about a process. Since the other constructions are not directly about predictability and prediction, we report on these findings elsewhere.

V Identifying the Hidden Linear Dynamic

We are now in a position to identify the hidden linear dynamic appropriate to many of the questions that arise in complex systems—their observation, predictability, prediction, and generation, as outlined in Table 2. In part, this section addresses a very practical need for specific calculations. In part, it also lays the foundations for further generalizations, to be discussed at the end. Identifying the linear dynamic means identifying the linear operator $A$ such that a question of interest can be reformulated as either being of the cascading form $\braket{\cdot}{A^{n}}{\cdot}$ or as an accumulation of such cascading events via $\braket{\cdot}{\left(\sum_{n}A^{n}\right)}{\cdot}$ ; recall Table 1. Helpfully, many well-known questions of complexity can be mapped to these archetypal forms. And so, we now proceed to uncover the hidden linear dynamics of the cascading questions approximately in the order they were introduced in § II.

V.1 Simple complexity from any presentation

For observable correlation, any HMM transition operator will do as the linear dynamic. We simply observe, let time (or space) evolve forward, and observe again. Let’s be concrete.

Recall the familiar autocorrelation function. For a discrete-domain process it is [53]:

[TABLE]

where $L\in\mathbb{Z}$ and the bar denotes the complex conjugate. The autocorrelation function is symmetric about $L=0$ , so we can focus on $L\geq 0$ . For $L=0$ , we simply have:

[TABLE]

For $L>0$ , we have:

[TABLE]

Each ‘ $*$ ’ above is a wildcard symbol denoting indifference to the particular symbol observed in its place. That is, the $*$ s denote marginalizing over the intervening random variables. We develop the consequence of this, explicitly calculating 444 Averaging over $t$ invokes unconditioned word probabilities that must be calculated using the stationary probability $\pi$ over the recurrent states. Effectively this ignores any transient nonstationarity that may exist in a process, since only the recurrent part of the HMM presentation plays a role in the autocorrelation function. One practical lesson is that if $T$ has transient states, they might as well be trimmed prior to such a calculation.

and finding:

[TABLE]

The result is the autocorrelation in cascading form $\braket{\cdot}{T^{t}}{\cdot}$ , which can be made particularly transparent by subsuming time-independent factors on the left and right into the bras and kets. Let’s introduce the new row vector:

[TABLE]

and column vector:

[TABLE]

Then, the autocorrelation function for nonzero integer $\tau$ is simply:

[TABLE]

Clearly, the autocorrelation function is a direct, albeit filtered, signature of iterates of the transition dynamic of any process presentation.

This result can easily be translated to the continuous-time setting. If the process is represented as a function of a Markov chain and we make the translation that:

[TABLE]

then the autocorrelation function for any $\tau\in\mathbb{R}$ is simply:

[TABLE]

where $G$ is determined from $T$ following §III.4. Again, the autocorrelation function is a direct fingerprint of the transition dynamic over the hidden states.

The power spectrum is a modulated accumulation of the autocorrelation function. With some algebra, one can show that it is:

[TABLE]

Reference [53] showed that for discrete-domain processes the continuous part of the power spectrum is simply:

[TABLE]

where Re $(\cdot)$ denotes the real part of its argument and $I$ is the identity matrix. Similarly, for continuous-domain processes one has:

[TABLE]

Although useful, these signatures of pairwise correlation are only first-order complexity measures. Common measures of complexity that include higher orders of correlation can also be written in the simple cascading form, but require a more careful choice of representation.

V.2 Predictability from a presentation MSP

For example, any HMM presentation allows us to calculate using Eq. (1) a process’s block entropy:

[TABLE]

but at a computational cost $\mathcal{O}\left(|\bm{\mathcal{S}}|^{3}L|\mathcal{A}|^{L}\right)$ exponential in $L$ , due to the exponentially growing number of words in $\mathcal{L}\cap\mathcal{A}^{L}$ . Consequently, using a general HMM one can neither directly nor efficiently calculate many key complexity measures, including a process’s entropy rate and excess entropy.

These limitations motivate using more specialized HMM classes. To take one example, it has been known for some time that a process’ entropy rate $h_{\mu}$ can be calculated directly from any of its unifilar presentations [46]. Another is that we can calculate the excess entropy directly from a process’s uHMM forward and reverse states [51, 52]: ${\bf E}=\operatorname{I}[\smash{\overleftarrow{{X}}};\smash{\overrightarrow{{X}}}]=\operatorname{I}[{\mathcal{S}}^{+};{\mathcal{S}}^{-}]$ .

However, efficient computation of myopic entropy rates $h_{\mu}(L)$ remained elusive for some time, and we only recently found their closed-form expression [3]. The myopic entropy rates are important because they represent the apparent entropy rate of a process if it is modeled as a finite Markov order-( $L-1$ ) process—a very common approximation. Crucially, the difference $h_{\mu}(L)-h_{\mu}$ from the process’ true entropy rate is the surplus entropy rate incurred by using an order- $L-1$ Markov approximation. Similarly, these surplus entropy rates lead directly to not only an apparent loss of predictability, but errors in inferred physical properties. These include overestimates of dissipation associated with the surplus entropy rate assigned to a physical thermodynamic system [31].

Unifilarity, it turns out, is not enough to calculate a process’ $h_{\mu}(L)$ directly. Rather, the $\mathscr{S}$ -MSP of any process presentation is what is required. Let’s now develop the closed-form expression for the myopic entropy rates, following Ref. [35].

The length- $L$ myopic entropy rate is the expected uncertainty in the $L^{\text{th}}$ random variable $X_{L-1}$ , given the preceding $L-1$ random variables $X_{0:L-1}$ :

[TABLE]

where, in the second line, we explicitly give the condition $\eta_{0}=\pi$ specifying our ignorance of the initial state. That is, without making any observations we can only assume that the initial distribution $\eta_{0}$ over $\mathcal{M}$ ’s states is the expected asymptotic distribution $\pi$ . For a mixing ergodic process, for example, even if another distribution $\eta_{-N}=\alpha$ was known in distant past, we still have $\bra{\eta_{0}}=\bra{\eta_{-N}}T^{N}\to\bra{\pi}$ , as $N\to\infty$ .

Assuming an initial probability distribution over $\mathcal{M}$ ’s states, a given observation sequence induces a particular sequence of updated state distributions. That is, the $\mathscr{S}$ -MSP( $\mathcal{M}$ ) is unifilar regardless of whether $\mathcal{M}$ is unifilar or not. Or, in other words, given the $\mathscr{S}$ -MSP’s unique start state— $\mathcal{R}_{0}=\pi$ —and a particular realization $X_{0:L-1}=w^{L-1}$ of the last $L-1$ random variables, we end up at the particular mixed state $\mathcal{R}_{L-1}=\eta_{w^{L-1}}\in\bm{\mathcal{R}}_{\pi}$ . Moreover, the entropy of the next observation is uniquely determined by $\mathcal{M}$ ’s state distribution, suggesting that Eq. (7) becomes:

[TABLE]

as proven elsewhere [35]. Intuitively, conditioning on all of the past observation random variables is equivalent to conditioning on the random variable for the state distribution induced by particular observation sequences.

We can now recast Eq. (7) in terms of the $\mathscr{S}$ -MSP, finding:

[TABLE]

Here:

[TABLE]

is simply the column vector whose $i^{\text{th}}$ entry is the entropy of transitioning from the $i^{\text{th}}$ state of $\mathscr{S}$ -MSP. Critically, $\ket{\operatorname{H}(W^{\mathcal{A}})}$ is independent of $L$ .

Notice that taking the logarithm of the sum of the entries of the row vector $\bra{\delta_{\eta}}W^{({x})}$ via $\bra{\delta_{\eta}}W^{({x})}\ket{\mathbf{1}}$ is only permissible since $\mathscr{S}$ -MSP’s unifilarity guarantees that $W^{({x})}$ has at most one nonzero entry per row. (We also use the familiar convention that $0\log_{2}0=0$ [13].)

The result is a particularly compact and efficient expression for the length- $L$ myopic entropy rates:

[TABLE]

Thus, all that is required is computing powers of the MSP transition dynamic. The computational cost $\mathcal{O}(L|\bm{\mathcal{R}}_{\pi}|^{3})$ is now only linear in $L$ . Moreover, $W$ is very sparse, especially so with a small alphabet $\mathcal{A}$ . And, this means that the computational cost can be reduced even further via numerical optimization.

With $h_{\mu}(L)$ in hand, the hierarchy of complexity measures that derive from it immediately follow, including the entropy rate $h_{\mu}$ , the excess entropy ${\bf E}$ , and the transient information ${\bf T}$ [1]. Specifically, we have:

[TABLE]

The sequel, Part II, discusses these in more detail, introducing their closed-form expressions. To prepare for this, we must first review the meromorphic functional calculus, which is needed for working with the above operators.

V.3 Continuous time?

We saw that correlation measures are easily extended to the continuous-time domain via continuous-time HMMs. Information measures, though, are awkward in continuous time, although progress has been made recently towards understanding their structure [55, 56].

V.4 Synchronization from generator MSP

If a process’ state-space is known, then the $\mathscr{S}$ -MSP of the generating model allows one to track the observation-induced distributions over its states. This naturally leads to closed-form solutions to informational questions about how an observer comes to know, or how it synchronizes to, the system’s states.

To monitor how an observer’s knowledge of a process’ internal state changes with increasing measurements we use the myopic state uncertainty $\mathcal{H}(L)=\operatorname{H}[\mathcal{S}_{0}|{X}_{-L:0}]$ [1]. Expressing it in terms of the $\mathscr{S}$ -MSP, one finds [35]:

[TABLE]

Here, $\operatorname{H}[\eta]$ is the presentation-state uncertainty specified by the mixed state $\eta$ :

[TABLE]

where $\ket{\delta_{s}}$ is the length- $|\bm{\mathcal{S}}|$ column vector of all zeros except for a $1$ at the appropriate index of the presentation-state $s$ .

Continuing, we re-express $\mathcal{H}(L)$ in terms of powers of the $\mathscr{S}$ -MSP transition dynamic:

[TABLE]

Here, we defined:

[TABLE]

which is the $L$ -independent length- $|\bm{\mathcal{R}}_{\pi}|$ column vector whose entries are the appropriately indexed entropies of each mixed state.

The forms of Eqs. (8) and (10) demonstrate that $h_{\mu}(L+1)$ and $\mathcal{H}(L)$ differ only in the type of information being extracted after being evolved by the operator: observable entropy $\ket{\operatorname{H}[\eta]}$ or state entropy $\operatorname{H}\left[\eta\right]$ , as implicated by their respective kets. Each of these entropies decreases as the distributions induced by longer observation sequences converge to their asymptotic form. If synchronization is achieved, the latter become delta functions on a single state and the associated entropies vanish.

Paralleling $h_{\mu}(L)$ , there is a complementary hierarchy of complexity measures that are built from functionals of $\mathcal{H}(L)$ . These include the asymptotic state uncertainty $\mathcal{H}$ and excess synchronization information ${\bf S}^{\prime}$ , to mention only two:

[TABLE]

Compared to the $h_{\mu}(L)$ family of measures, $\mathcal{H}$ and ${\bf S}^{\prime}$ mirror the roles of $h_{\mu}$ and ${\bf E}$ , respectively.

The model state-complexity:

[TABLE]

also has an analog in the $h_{\mu}(L)$ hierarchy—the process’ alphabet complexity:

[TABLE]

V.5 Optimal prediction from $\epsilon$ -machine MSP

We just reviewed the linear underpinnings of synchronizing to any model of a process. However, the myopic state uncertainty of the $\epsilon$ -machine has a distinguished role in determining the synchronization cost for optimally predicting a process, regardless of the presentation that generated it. Using the $\epsilon$ -machine’s $\mathscr{S}$ -MSP, the $\epsilon$ -machine myopic state uncertainty can be written in direct parallel to the myopic state uncertainty of any model:

[TABLE]

The script $\mathcal{W}$ emphasizes that we are now specifically working with the state-to-state transition dynamic of the $\epsilon$ -machine’s MSP.

Paralleling $\mathcal{H}(L)$ , an obvious hierarchy of complexity measures is built from functionals of $\mathcal{H}^{+}(L)$ . For example, the $\epsilon$ -machine’s state-complexity is the statistical complexity $C_{\mu}=\mathcal{H}^{+}(0)$ . The information that must be obtained to synchronize to the causal state and thus optimally predict—the causal synchronization information—is given in terms of the $\epsilon$ -machine’s $\mathscr{S}$ -MSP by ${\bf S}=\sum_{L=0}^{\infty}\mathcal{H}^{+}(L)$ .

An important difference when using $\epsilon$ -machine presentations is that they have zero asymptotic state uncertainty:

[TABLE]

Therefore, ${\bf S}={\bf S}^{\prime}(\mbox{$ \epsilon $-machine})$ . Moreover, we conjecture that ${\bf S}=\min_{\mathcal{M}}\sum_{L=0}^{\infty}\mathcal{H}(L)$ for any presentation $\mathcal{M}$ that generates the process, even if $C_{\mu}\geq C_{g}$ .

V.6 Beyond the MSP

Many of the complexity measures use a mixed-state presentation as the appropriate linear dynamic, with particular focus on the $\mathscr{S}$ -MSP. However, we want to emphasize that this is more a reflection of questions that have become common. It does not indicate the general answer that one expects in the broader approach to finding the hidden linear dynamic. Here, we give a brief overview for how other linear dynamics can appear for different types of complexity questions. These have been uncovered recently and will be reported on in more detail in sequels.

First, we found the reverse-time mixed-functional presentation (MFP) of any forward-time generator. The MFP tracks the reverse-time dynamic over linear functionals $\ket{\eta}$ of state distributions induced by reverse-time observations:

[TABLE]

The MFP allows direct calculation of the convergence of the preparation uncertainty $\reflectbox{$ \mathcal{H} $}(L)\equiv\operatorname{H}(\mathcal{S}_{0}|X_{0:L})$ via powers of the linear MFP transition dynamic. The preparation uncertainty in turn gives a new perspective on the transient information since:

[TABLE]

can be interpreted as the predictive advantage of hindsight. Related, the myopic process crypticity $\chi(L)=\reflectbox{$ \mathcal{H} $}^{+}(L)-\mathcal{H}^{+}(L)$ had been previously introduced [36]. Since $\lim_{L\to\infty}\mathcal{H}^{+}(L)=\mathcal{H}^{+}=0$ , the asymptotic crypticity is $\chi=\reflectbox{$ \mathcal{H} $}^{+}+\mathcal{H}^{+}=\reflectbox{$ \mathcal{H} $}^{+}$ . And, this reveals a refined partitioning underlying the sum:

[TABLE]

Crypticity $\chi=\operatorname{H}(\mathcal{S}_{0}^{+}|X_{0:\infty})$ itself is positive only if the process’ cryptic order:

[TABLE]

is positive. The cryptic order is always less than or equal to its better known cousin, the Markov order $R$ :

[TABLE]

since conditioning can never increase entropy. In the case of cryptic order, we condition on future observations $X_{0:\infty}$ .

The forward-time cryptic operator presentation gives the forward-time observation-induced dynamic over the operators:

[TABLE]

Since the reverse causal state $\mathcal{S}_{0}^{-}$ at time 0 is a linear combination of forward causal states [57, 58], this presentation allows new calculations of the convergence to crypticity that implicate $\Pr(\mathcal{S}_{0}^{+}|X_{-L:\infty})$ .

In fact, the cryptic operator presentation is a special case of the more general myopic bidirectional dynamic over operators :

[TABLE]

induced by new observations of either the future or the past. This is key to understanding the interplay between forgetfulness and shortsightedness: $\Pr(\mathcal{S}_{0}|X_{-M:0},X_{0:N})$ .

The list of these extensions continues. Detailed bounds on entropy-rate convergence are obtained from the transition dynamic of the so-called possibility machine, beyond the asymptotic result obtained in Ref. [42]. And, the importance of post-synchronized monitoring, as quantified by the information lost due to negligence over a duration $\ell$ :

[TABLE]

can be determined using yet another type of modified MSP.

These examples all find an exact solution via a theory parallel to that outlined in the following, but applied to the linear dynamic appropriate for the corresponding complexity question. Furthermore, they highlight the opportunity, enabled by the full meromorphic functional calculus [4], to ask and answer more nuanced and, thus, more probing questions about structure, predictability, and prediction.

V.7 The end?

It would seem that we achieved our goal. We identified the appropriate transition dynamic for common complexity questions and, by some standard, gave formulae for their exact solution. In point of fact, the effort so far has all been in preparation. Although we set the framework up appropriately for linear analysis, closed-form expressions for the complexity measures still await the mathematical developments of the following sections. At the same time, at the level of qualitative understanding and scientific interpretation, so far we failed to answer the simple question:

•

What range of possible behaviors do these complexity measures exhibit?

and the natural follow-up question:

•

What mechanisms produce qualitatively different informational signatures?

The following section reviews the recently developed functional calculus that allows us to actually decompose arbitrary functions of the nondiagonalizable hidden dynamic to give conclusive answers to these fundamental questions [4]. We then analyze the range of possible behaviors and identify the internal mechanisms that give rise to qualitatively different contributions to complexity.

The investment in this and the succeeding sections allow Part II to express new closed-form solutions for many complexity measures beyond what those achieved to date. In addition to obvious calculational advantages, this also gives new insights into possible behaviors of the complexity measures and, moreover, their unexpected similarities with each other. In many ways, the results shed new light on what we were (implicitly) probing with already-familiar complexity measures. Constructively, this suggests extending complexity magnitudes to complexity functions that succinctly capture the organization to all orders of correlation. Just as our intuition for pairwise correlation grows out of power spectra, so too these extensions unveil the workings of both a process’ predictability and the burden of prediction for an observer.

VI Spectral Theory beyond the Spectral Theorem

Here, we briefly review the spectral decomposition theory from Ref. [4] needed for working with linear operators. As will become clear, it goes significantly beyond the spectral theorem for normal operators.

VI.1 Spectral primer

We restrict our attention to operators that have at most a countably infinite spectrum. Such operators share many features with finite-dimensional square matrices. And so, we review several elementary but essential facts that are used extensively in the following.

Recall that if $A$ is a finite-dimensional square matrix, then $A$ ’s spectrum is simply its set of eigenvalues:

[TABLE]

where det $(\cdot)$ is the determinant of its argument.

For reference later, recall that the algebraic multiplicity $a_{\lambda}$ of eigenvalue $\lambda$ is the power of the term $(z-\lambda)$ in the characteristic polynomial det $(zI-A)$ . In contrast, the geometric multiplicity $g_{\lambda}$ is the dimension of the kernel of the transformation $A-\lambda I$ or the number of linearly independent eigenvectors for the eigenvalue. The algebraic and geometric multiplicities are all equal when the matrix is diagonalizable.

Since there can be multiple subspaces associated with a single eigenvalue, corresponding to different Jordan blocks in the Jordan canonical form, it is structurally important to introduce the index of the eigenvalue to describe the size of its largest-dimension associated subspace.

Definition 4.

The index $\nu_{\lambda}$ of eigenvalue $\lambda$ is the size of the largest Jordan block associated with $\lambda$ .

The index gives information beyond what the algebraic and geometric multiplicities themselves reveal. Nevertheless, for $\lambda\in\Lambda_{A}$ , it is always true that $\nu_{\lambda}-1\leq a_{\lambda}-g_{\lambda}\leq a_{\lambda}-1$ . In the diagonalizable case, $a_{\lambda}=g_{\lambda}$ and $\nu_{\lambda}=1$ for all $\lambda\in\Lambda_{A}$ .

The resolvent:

[TABLE]

defined with the help of the continuous complex variable $z\in\mathbb{C}$ , captures all of the spectral information about $A$ through the poles of the resolvent’s matrix elements. In fact, the resolvent contains more than just the spectrum: the order of each pole gives the index of the corresponding eigenvalue.

Each eigenvalue $\lambda$ of $A$ has an associated projection operator $A_{\lambda}$ , which is the residue of the resolvent as $z\to\lambda$ :

[TABLE]

The residue of the matrix can be calculated elementwise.

The projection operators are orthonormal:

[TABLE]

and sum to the identity:

[TABLE]

For cases where $\nu_{\lambda}=1$ , we found that the projection operator associated with $\lambda$ can be calculated as [4]:

[TABLE]

Not all projection operators of a nondiagonalizable operator can be found directly from Eq. (14), since some have index larger than one. However, if there is only one eigenvalue that has index larger than one—the almost diagonalizable case treated in Part II—then Eq. (14), together with the fact that the projection operators must sum to the identity, does give a full solution to the set of projection operators. Next, we consider the general case, with no restriction on $\nu_{\lambda}$ .

VI.2 Eigenprojectors: Left, right, generalized

In general, as we now discuss, an operator’s eigenprojectors can be obtained from all left and right eigenvectors and generalized eigenvectors associated with the eigenvalue. Given the $n$ -tuple of possibly-degenerate eigenvalues $(\Lambda_{A})=(\lambda_{1},\,\lambda_{2},\,\dots\,,\,\lambda_{n})$ , there is a corresponding $n$ -tuple of $m_{k}$ -tuples of linearly-independent generalized right-eigenvectors:

[TABLE]

where:

[TABLE]

and a corresponding $n$ -tuple of $m_{k}$ -tuples of linearly-independent generalized left-eigenvectors:

[TABLE]

where:

[TABLE]

such that:

[TABLE]

and:

[TABLE]

for $0\leq m\leq m_{k}-1$ , where $\ket{\lambda_{j}^{(0)}}=\vec{0}$ and $\bra{\lambda_{j}^{(0)}}=\vec{0}$ . Specifically, $\ket{\lambda_{k}^{(1)}}$ and $\bra{\lambda_{k}^{(1)}}$ are conventional right and left eigenvectors, respectively.

Recall that eigenvalue $\lambda\in\Lambda_{A}$ corresponds to $g_{\lambda}$ different Jordan blocks, where $g_{\lambda}$ is $\lambda$ ’s geometric multiplicity. In fact:

[TABLE]

Moreover, $\lambda$ ’s index $\nu_{\lambda}$ is the size of the largest Jordan block corresponding to $\lambda$ :

[TABLE]

Most directly, the generalized right and left eigenvectors can be found as the nontrivial solutions to:

[TABLE]

and:

[TABLE]

respectively. Imposing appropriate normalization, we find that:

[TABLE]

Crucially, right and left eigenvectors are no longer simply related by complex-conjugate transposition and right eigenvectors are not necessarily orthogonal to each other. Rather, left eigenvectors and generalized eigenvectors form a dual basis to the right eigenvectors and generalized eigenvectors. Somewhat surprisingly, the most generalized left eigenvector $\bra{\lambda_{k}^{(m_{k})}}$ associated with $\lambda_{k}$ is dual to the least generalized right eigenvector $\ket{\lambda_{k}^{(1)}}$ associated with $\lambda_{k}$ :

[TABLE]

Explicitly, we find that the projection operators for a nondiagonalizable matrix can be written as:

[TABLE]

VI.3 Companion operators and resolvent decomposition

It is useful to introduce the generalized set of companion operators:

[TABLE]

for $\lambda\in\Lambda_{A}$ and $m\in\{0,1,2,\dots\}$ . These operators satisfy the following semigroup relation:

[TABLE]

$A_{\lambda,m}$ reduces to the eigenprojector for $m=0$ :

[TABLE]

and it exactly reduces to the zero-matrix for $m\geq\nu_{\lambda}$ :

[TABLE]

Crucially, we can rewrite the resolvent as a weighted sum of the companion matrices $\{A_{\lambda,m}\}$ , with complex coefficients that have poles at each eigenvalue $\lambda$ up to the eigenvalue’s index $\nu_{\lambda}$ :

[TABLE]

Ultimately these results allow us to evaluate arbitrary functions of nondiagonalizable operators, to which we now turn. (Reference [4] gives more background.)

VI.4 Functions of nondiagonalizable operators

The meromorphic functional calculus [4] gives meaning to arbitrary functions $f(\cdot)$ of any linear operator $A$ . Its starting point is the Cauchy-integral-like formula:

[TABLE]

where $C_{\lambda}$ denotes a sufficiently small counterclockwise contour around $\lambda$ in the complex plane such that no singularity of the integrand besides the possible pole at $z=\lambda$ is enclosed by the contour.

Invoking Eq. (23) yields the desired formulation:

[TABLE]

Hence, with the eigenprojectors $\{A_{\lambda}\}_{\lambda\in\Lambda_{A}}$ in hand, evaluating an arbitrary function of the nondiagonalizable operator $A$ comes down to the evaluation of several residues.

Typically, evaluating Eq. (25) requires less work than one might expect when looking at the equation in its full generality. For example, whenever $f(z)$ is holomorphic (i.e., well behaved) at $z=\lambda$ , the residue simplifies to:

[TABLE]

where $f^{(m)}(\lambda)$ is the $m^{\text{th}}$ derivative of $f(z)$ evaluated at $z=\lambda$ . However, if $f(z)$ has a pole or zero at $z=\lambda$ , then it substantially changes the complex contour integration. In the simplest case, when $A$ is diagonalizable and $f(z)$ is holomorphic at $\Lambda_{A}$ , the matrix-valued function reduces to the simple form:

[TABLE]

Moreover, if $\lambda$ is nondegenerate, then:

[TABLE]

although $\bra{\lambda}$ here should be interpreted as the solution to the left eigenequation $\bra{\lambda}A=\lambda\bra{\lambda}$ and, in general, $\bra{\lambda}\neq(\ket{\lambda})^{\dagger}$ .

The meromorphic functional calculus agrees with the Taylor-series approach whenever the series converges and agrees with the holomorphic functional calculus of Ref. [59] whenever $f(z)$ is holomorphic at $\Lambda_{A}$ . However, when both these functional calculi fail, the meromorphic functional calculus extends the domain of $f(A)$ in a way that is key to the following analysis. We show, for example, that within the meromorphic functional calculus, the negative-one power of a singular operator is the Drazin inverse. The Drazin inverse effectively inverts everything that is invertible. Notably, it appears ubiquitously in the new-found solutions to many complexity measures.

VI.5 Evaluating residues

How does one use Eq. (25)? It says that the spectral decomposition of $f(A)$ reduces to the evaluation of several residues, where:

[TABLE]

So, to make progress with Eq. (25), we must evaluate functional-dependent residues of the form $\text{Res}\left(f(z)/(z-\lambda)^{m+1},\,z\to\lambda\right)$ . This is basic complex analysis. Recall that the residue of a complex-valued function $g(z)$ around its isolated pole $\lambda$ of order $n+1$ can be calculated from:

[TABLE]

VI.6 Decomposing $A^{L}$

Equation (25) allows us to explicitly derive the spectral decomposition of powers of an operator. For $f(A)=A^{L}\to f(z)=z^{L}$ , $z=0$ can be either a zero or a pole of $f(z)$ , depending on the value of $L$ . In either case, an eigenvalue of $\lambda=0$ will distinguish itself in the residue calculation of $A^{L}$ via its unique ability to change the order of the pole (or zero) at $z=0$ .

For example, at this special value of $\lambda$ and for integer $L>0$ , $\lambda=0$ induces poles that cancel with the zeros of $f(z)=z^{L}$ , since $z^{L}$ has a zero at $z=0$ of order $L$ . For integer $L<0$ , an eigenvalue of $\lambda=0$ increases the order of the $z=0$ pole of $f(z)=z^{L}$ . For all other eigenvalues, the residues will be as expected.

Hence, for any $L\in\mathbb{C}$ :

[TABLE]

where $\binom{L}{m}$ is the generalized binomial coefficient:

[TABLE]

with $\binom{L}{0}=1$ and where $[0\in\Lambda_{A}]$ is the Iverson bracket. The latter takes value $1$ if [math] is an eigenvalue of $A$ and value [math] if not. Equation (26) applies to any linear operator with only isolated singularities in its resolvent.

If $L$ is a nonnegative integer such that $L\geq\nu_{\lambda}-1$ for all $\lambda\in\Lambda_{A}$ , then:

[TABLE]

where $\binom{L}{m}$ is now reduced to the traditional binomial coefficient $L!/m!(L-m)!$ .

VI.7 Drazin inverse

The negative-one power of a linear operator is in general not the same as the inverse $\text{inv}(\cdot)$ , since $\text{inv}(A)$ need not exist. However, the negative-one power of a linear operator is always defined via Eq. (26):

[TABLE]

Notably, when the operator is singular, we find that:

[TABLE]

This is the Drazin inverse $A^{\mathcal{D}}$ of $A$ , also known as the $\{1^{\nu_{0}},2,5\}$ -inverse [60]. (Note that it is not the same as the Moore–Penrose pseudo-inverse.) Although the Drazin inverse is usually defined axiomatically to satisfy certain criteria, here it naturally derived as the negative one power of a singular operator in the meromorphic functional calculus.

Whenever $A$ is invertible, however, $A^{-1}=\text{inv}(A)$ . That said, we should not confuse this coincidence with equivalence. More to the point, there is no reason other than accidents of historic notation that the negative-one power should in general be equivalent to the inverse—especially if an operator is not invertible. To avoid confusing $A^{-1}$ with $\text{inv}(A)$ , we use the notation $A^{\mathcal{D}}$ for the Drazin inverse of $A$ . Still, $A^{\mathcal{D}}=\text{inv}(A)$ whenever $0\notin\Lambda_{A}$ .

Although Eq. (29) is a constructive way to build the Drazin inverse, it suggests more work than is actually necessary. We derived several simple constructions for it that require only the original operator and the eigenvalue-[math] projector. For example, Ref. [4] found that, for any $c\in\mathbb{C}\setminus\{0\}$ :

[TABLE]

Later, we will also need the decomposition of $(I-W)^{\mathcal{D}}$ , as it enters into many closed-form complexity expressions. Reference [4] showed that:

[TABLE]

for any stochastic matrix $T$ . If $T$ is the state-transition matrix of an ergodic process, then the RHS of Eq. (31) becomes especially simple to evaluate since then $T_{1}=\ket{\mathbf{1}}\bra{\pi}$ .

Somewhat tangentially, this connects to the fundamental matrix $Z=(I-T+T_{1})^{-1}$ used by Kemeny and Snell [61] in their analysis of Markovian dynamics. More immediately, Eq. (31) plays a prominent role when deriving excess entropy and synchronization information. The explicit spectral decomposition is also useful:

[TABLE]

VII Projection Operators for Stochastic Dynamics

The preceding employed the notation that $A$ is a general linear operator. In the following, we reserve $T$ for the operator of a stochastic transition dynamic, as in the state-to-state transition dynamic of an HMM: $T=\sum_{x\in\mathcal{A}}T^{(x)}$ . If the state space is finite and has a stationary distribution, then $T$ has a representation that is a nonnegative row-stochastic—all rows sum to unity—transition matrix.

We are now in a position to summarize several useful properties for the projection operators of any row-stochastic matrix $T$ . Naturally, if one uses column-stochastic instead of row-stochastic matrices, all results can be translated by simply taking the transpose of every line in the derivations. (Recall that $(ABC)^{\top}=C^{\top}B^{\top}A^{\top}$ .)

The transition matrix’s nonnegativity guarantees that for each $\lambda\in\Lambda_{T}$ its complex conjugate $\overline{\lambda}$ is also in $\Lambda_{T}$ . Moreover, the projection operator associated with the complex conjugate of $\lambda$ is the complex conjugate of $T_{\lambda}$ :

[TABLE]

If the dynamic induced by $T$ has a stationary distribution over the state space, then $T$ ’s spectral radius is unity and all its eigenvalues lie on or within the unit circle in the complex plane. The maximal eigenvalues have unity magnitude and $1\in\Lambda_{T}$ . Moreover, an extension of the Perron–Frobenius theorem guarantees that eigenvalues on the unit circle have algebraic multiplicity equal to their geometric multiplicity. And, so, $\nu_{\zeta}=1$ for all $\zeta\in\{\lambda\in\Lambda_{T}:|\lambda|=1\}$ .

$T$ ’s index-one eigenvalue $\lambda=1$ is associated with stationarity of the hidden Markov chain. $T$ ’s other eigenvalues on the unit circle are roots of unity and correspond to deterministic periodicities within the process.

VII.1 Row sums

If $T$ is row-stochastic, then by definition:

[TABLE]

Hence, via the general eigenprojector construction Eq. (18) and the general orthogonality condition Eq. (17), we find that:

[TABLE]

This shows that $T$ ’s projection operator $T_{1}$ is row-stochastic, whereas each row of every other projection operator must sum to zero. This can also be viewed as a consequence of conservation of probability for dynamics over Markov chains.

VII.2 Expected stationary distribution

If unity is the only eigenvalue of $\Lambda_{T}$ on the unit circle, then the process has no deterministic periodicities. In this case, every initial condition leads to an stationary asymptotic distribution. The expected stationary distribution $\pi_{\alpha}$ from any initial distribution $\alpha$ is:

[TABLE]

An attractive feature of Eq. (34) is that it holds even for nonergodic processes—those with multiple stationary components.

When the stochastic process is ergodic (one stationary component), then $a_{1}=1$ and there is only one stationary distribution $\pi$ . The $T_{1}$ projection operator becomes:

[TABLE]

even if there are deterministic periodicities. Deterministic periodicities imply that different initial conditions may still induce different asymptotic oscillations, according to $\{T_{\lambda}:|\lambda|=1\}$ . In the case of ergodic processes without deterministic periodicities, every initial condition relaxes to the same steady-state distribution over the hidden states: $\bra{\pi_{\alpha}}=\bra{\alpha}T_{1}=\bra{\pi}$ regardless of $\alpha$ , so long as $\alpha$ is a properly normalized probability distribution.

VIII Spectra by inspection

As suggested in Ref. [4], the new results above extend spectral theory to arbitrary functions of nondiagonalizable operators in a way that gives a spectral weighted digraph theory beyond the purview of spectral graph theory proper [62]. Moreover, this enables new analyses. The next sections show how spectra and eigenprojectors can be intuited, computed, and applied in the analysis of complex systems.

VIII.1 Eigenvalues

Consider a directed graph structure with cascading dependencies: one cluster of nodes feeds back only to itself according to matrix $A$ and feeds forward to another cluster of nodes according to matrix $B$ , which is not necessarily a square matrix. The second cluster feeds back only to itself according to matrix $C$ . The latter node cluster might also feed forward to another cluster, but such considerations can be applied iteratively.

The simple situation just described is summarized, with proper index permutation, by a block matrix of the form: $W=\begin{bmatrix}A&B\\ \bm{0}&C\end{bmatrix}$ . In this case, it is easy to see that:

[TABLE]

And so, $\Lambda_{W}=\Lambda_{A}\cup\Lambda_{C}$ . This simplification presents an opportunity to read off eigenvalues from clustered graph structures that often appear in practice, especially for transient graph structures associated with transient causal states in $\epsilon$ -machines.

Cyclic cluster structures (say, of length $N$ and edge-weights $\alpha_{1}$ through $\alpha_{N}$ ) yield especially simple spectra:

[TABLE]

That is, the eigenvalues are simply the $N^{\text{th}}$ roots of the product of all of the edge-weights. See Fig. 2a.

Similar rules for reading off spectra from other cluster structures exist. Although we cannot list them exhaustively here, we give another simple but useful rule in Fig. 2b. It also indicates the ubiquity of nondiagonalizability in weighted digraph structures. This second rule is suggestive of further generalizations where spectra can be read off from common digraph motifs.

VIII.2 Eigenprojectors from graph structure

We just outlined how clustered directed graph structures yield simplified joint spectra. Is there a corresponding simplification of the projection operators? In fact, there is and it leads to an iterative construction of “higher-level” projectors from “lower-level” clustered components. In contrast to the joint spectrum though, that completely ignores the feedforward matrix $B$ , the emergent projectors do require $B$ to pull the associated eigencontributions into the generalized setting. Figure 3 summarizes the results for the simple case of nondegenerate eigenvalues. The general case is constructed similarly.

The preceding results imply a number of algorithms, both for analytic and numerical calculations. Most directly, this points to the fact that eigenanalysis can be partitioned into a series of simpler problems that are later combined to a final solution. However, in addition to more efficient serial computation, there are opportunities for numerical parallelization of the algorithms to compute the eigenprojectors, whether they are computed directly, say from Eq. (14), or from right and left eigenvectors and generalized eigenvectors. Such automation is useful for applying our analysis to real systems with immense data produced from very high-dimensional state spaces.

IX Conclusion

Surprisingly, many questions we ask about a structured stochastic nonlinear process imply a linear dynamic over a preferred hidden state space. These questions often concern predictability and prediction. To make predictions about the real world, though, it is not sufficient to have a model of the world. Additionally, the predictor must synchronize their model to the real-world data that has been observed up to the present time. This metadynamic of synchronization—the transition structure among belief states—is intrinsically linear, but is typically nondiagonalizable.

Recall the organizational tables from the Introduction. After all of the intervening detail, let’s consider a more nuanced formulation. We saw that once we frame our questions in terms of the hidden linear transition dynamic, complexity measures are usually either of the cascading or accumulation type. Scalar complexity measures often accumulate only the interesting transient structure that rides on top of the asymptotics. Skimming off the asymptotics led to a Drazin inverse. Modified accumulation turns complexity scalars into complexity functions. This is summarized in Table 3 and Table 4. It is notable that Table 4 gives closed-form formulae for many complexity measures that previously were only expressed as infinite sums over functions of probabilities.

Let us remind ourselves: Why, in this analysis, were nondiagonalizable dynamics noteworthy? They are noteworthy since the metadynamics of diagonalizable dynamics are generically nondiagonalizable—typically due to the zero-eigenvalue subspace that is responsible for the initial, ephemeral epoch of symmetry collapse. We saw this explicitly with the metadynamics of transitioning between belief states. However, other metadynamics beyond that focused on prediction are also generically nondiagonalizable. For example, in the analysis of quantum compression, crypticity, and other aspects of hidden structure, the relevant linear dynamic is not the MSP, but is nevertheless a nondiagonalizable structure that is fruitfully analyzed with the recently generalized spectral theory of nonnormal operators [4].

Using the appropriate dynamic for common complexity questions and the meromorphic functional calculus to overcome nondiagonalizability, the sequel (Part II) goes on to develop closed-form expressions for complexity measures as simple functions of the corresponding transition dynamic of the implied HMM.

Acknowledgments

JPC thanks the Santa Fe Institute for its hospitality. The authors thank Chris Ellison, Ryan James, John Mahoney, Alec Boyd, and Dowman Varn for helpful discussions. This material is based upon work supported by, or in part by, the U. S. Army Research Laboratory and the U. S. Army Research Office under contract numbers W911NF-12-1-0234, W911NF-13-1-0340, and W911NF-13-1-0390.

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. P. Crutchfield and D. P. Feldman. Regularities unseen, randomness observed: Levels of entropy convergence. CHAOS , 13(1):25–54, 2003.
2[2] S. E. Marzen and J. P. Crutchfield. Nearly maximally predictive features and their dimensions. Phys. Rev. E , in press, 2017. arxiv.org:1702.08565].
3[3] J. P. Crutchfield, C. J. Ellison, and P. M. Riechers. Exact complexity: The spectral decomposition of intrinsic computation. Phys. Lett. A , 380(9):998–1002, 2016.
4[4] P. M. Riechers and J. P. Crutchfield. Beyond the spectral theorem: Decomposing arbitrary functions of nondiagonalizable operators. arxiv.org:1607.06526.
5[5] While we follow Shannon [ 12 ] in this, it differs from the more widely used state-labeled HM Ms.
6[6] C. Moore and J. P. Crutchfield. Quantum automata and quantum grammars. Theoret. Comp. Sci. , 237:1-2:275–306, 2000.
7[7] L. A. Clark, W. Huang, T. M. Barlow, and A. Beige. Hidden quantum markov models and open quantum systems with instantaneous feedback. New. J. Phys. , 14:143–151, 2015.
8[8] O. Penrose. Foundations of statistical mechanics; a deductive treatment . Pergamon Press, Oxford, 1970.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Spectral Simplicity of Apparent Complexity, Part I:

Abstract

pacs:

Contents

I Introduction

II Structured Processes and their Complexities

II.1 Directly observable organization

II.2 Intrinsic predictability

II.3 Prediction overhead

II.4 Generative complexities

III Hidden Markov Models

Definition 1**.**

III.1 Unifilar HMMs

Definition 2**.**

III.2 Minimal unifilar HMMs

Definition 3**.**

III.3 Finitary stochastic process hierarchy

III.4 Continuous-time HMMs

IV Mixed-State Presentations

V Identifying the Hidden Linear Dynamic

V.1 Simple complexity from any presentation

V.2 Predictability from a presentation MSP

V.3 Continuous time?

V.4 Synchronization from generator MSP

V.5 Optimal prediction from ϵ\epsilonϵ-machine MSP

V.6 Beyond the MSP

V.7 The end?

VI Spectral Theory beyond the Spectral Theorem

VI.1 Spectral primer

Definition 4**.**

VI.2 Eigenprojectors: Left, right, generalized

VI.3 Companion operators and resolvent decomposition

VI.4 Functions of nondiagonalizable operators

VI.5 Evaluating residues

VI.6 Decomposing ALA^{L}AL

VI.7 Drazin inverse

VII Projection Operators for Stochastic Dynamics

VII.1 Row sums

VII.2 Expected stationary distribution

VIII Spectra by inspection

VIII.1 Eigenvalues

VIII.2 Eigenprojectors from graph structure

IX Conclusion

Acknowledgments

Definition 1.

Definition 2.

Definition 3.

V.5 Optimal prediction from $\epsilon$ -machine MSP

Definition 4.

VI.6 Decomposing $A^{L}$