Latent Representations of Dynamical Systems: When Two is Better Than One

Max Tegmark (MIT)

arXiv:1902.03364·physics.data-an·February 22, 2019

Latent Representations of Dynamical Systems: When Two is Better Than One

Max Tegmark (MIT)

PDF

Open Access

TL;DR

This paper demonstrates that using separate latent mappings for present and future states in dynamical systems prediction is theoretically optimal and outperforms traditional single-mapping methods, especially for non-time-reversible systems.

Contribution

It introduces a novel approach that employs two different latent representations for present and future, challenging the common single-mapping paradigm in dynamical system prediction.

Findings

01

Two-mapping approach outperforms PCA and single-mapping methods

02

Optimality of separate mappings for non-time-reversible systems

03

Illustrated with coupled harmonic oscillators with noise and dissipation

Abstract

A popular approach for predicting the future of dynamical systems involves mapping them into a lower-dimensional "latent space" where prediction is easier. We show that the information-theoretically optimal approach uses different mappings for present and future, in contrast to state-of-the-art machine-learning approaches where both mappings are the same. We illustrate this dichotomy by predicting the time-evolution of coupled harmonic oscillators with dissipation and thermal noise, showing how the optimal 2-mapping method significantly outperforms principal component analysis and all other approaches that use a single latent representation, and discuss the intuitive reason why two representations are better than one. We conjecture that a single latent representation is optimal only for time-reversible processes, not for e.g. text, speech, music or out-of-equilibrium physical systems.

Figures4

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Data distillation: the relationship between Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), nonlinear autoencoders and nonlinear latent representations.

Random	What is	Probability distribution
vectors	distilled?	Gaussian	Non-Gaussian
1	Entropy	PCA	Autoencoder
	$H (𝐱) = \sum H (u_{i})$	$𝐮 = 𝐅𝐱$	u=f(x)
2	Mutual information	CCA	Latent reps
	$I (𝐱, 𝐲) = \sum I (u_{i}, v_{i})$	$𝐮 = 𝐅𝐱$	$𝐮 = f (𝐱)$
		$𝐯 = 𝐆𝐲$	$𝐯 = g (𝐲)$

Equations52

x = (x _{2} x _{1}) .

x = (x _{2} x _{1}) .

{\bf T}=\langle{\bf x}{\bf x}^{t}\rangle=\left(\begin{tabular}[]{cc}${\bf C}_{0}$&${\bf B}$\\ ${\bf B}^{t}$&${\bf C}_{1}$\end{tabular}\right).

{\bf T}=\langle{\bf x}{\bf x}^{t}\rangle=\left(\begin{tabular}[]{cc}${\bf C}_{0}$&${\bf B}$\\ ${\bf B}^{t}$&${\bf C}_{1}$\end{tabular}\right).

I (x_{1}, x_{2}) = \frac{1}{2} lo g \frac{∣ C _{1} ∣ ∣ C _{2} ∣}{∣ T ∣},

I (x_{1}, x_{2}) = \frac{1}{2} lo g \frac{∣ C _{1} ∣ ∣ C _{2} ∣}{∣ T ∣},

C_{1} = U_{1} Λ_{1} U_{1}^{t}, C_{1} = U_{2} Λ_{2} U_{2}^{t},

C_{1} = U_{1} Λ_{1} U_{1}^{t}, C_{1} = U_{2} Λ_{2} U_{2}^{t},

{\bf P}\equiv\left(\begin{tabular}[]{cc}$s({\bf\Lambda}_{1}){\bf U}_{1}^{t}$&${\bf 0}$\\ ${\bf 0}$&$s({\bf\Lambda}_{2}){\bf U}_{2}^{t}$\end{tabular}\right),

{\bf P}\equiv\left(\begin{tabular}[]{cc}$s({\bf\Lambda}_{1}){\bf U}_{1}^{t}$&${\bf 0}$\\ ${\bf 0}$&$s({\bf\Lambda}_{2}){\bf U}_{2}^{t}$\end{tabular}\right),

s(\lambda)=\left\{\begin{tabular}[]{cl}$0$&$\quad$if $|\lambda|<\epsilon$\\ $\lambda^{-1/2}$&$\quad$otherwise\end{tabular}\right.

s(\lambda)=\left\{\begin{tabular}[]{cl}$0$&$\quad$if $|\lambda|<\epsilon$\\ $\lambda^{-1/2}$&$\quad$otherwise\end{tabular}\right.

{\bf T}^{\prime}\equiv{\bf P}{\bf T}{\bf P}^{t}=\left(\begin{tabular}[]{cccc}${\bf I}$&${\bf 0}$&${\bf Q}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf Q}^{t}$&${\bf 0}$&${\bf I}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ \end{tabular}\right),

{\bf T}^{\prime}\equiv{\bf P}{\bf T}{\bf P}^{t}=\left(\begin{tabular}[]{cccc}${\bf I}$&${\bf 0}$&${\bf Q}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf Q}^{t}$&${\bf 0}$&${\bf I}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ \end{tabular}\right),

Q = U R V^{t},

Q = U R V^{t},

{\bf D}=\left(\begin{tabular}[]{cccc}${\bf U}^{t}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf V}^{t}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ \end{tabular}\right){\bf P},

{\bf D}=\left(\begin{tabular}[]{cccc}${\bf U}^{t}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf V}^{t}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ \end{tabular}\right){\bf P},

{\bf D}{\bf T}{\bf D}^{t}=\left(\begin{tabular}[]{cccc}${\bf I}$&${\bf 0}$&${\bf R}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf R}$&${\bf 0}$&${\bf I}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ \end{tabular}\right).

{\bf D}{\bf T}{\bf D}^{t}=\left(\begin{tabular}[]{cccc}${\bf I}$&${\bf 0}$&${\bf R}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ ${\bf R}$&${\bf 0}$&${\bf I}$&${\bf 0}$\\ ${\bf 0}$&${\bf 0}$&${\bf 0}$&${\bf 0}$\\ \end{tabular}\right).

x^{'} = (x _{2}^{'} x _{1}^{'}) \equiv D x,

x^{'} = (x _{2}^{'} x _{1}^{'}) \equiv D x,

I (x_{1}, x_{2})

I (x_{1}, x_{2})

u \equiv F x, v \equiv G y

u \equiv F x, v \equiv G y

F \equiv Π_{k} U^{t} Λ_{1}^{- 1/2} U_{1}^{t}, Π_{k} F \equiv V^{t} Λ_{2}^{- 1/2} U_{2}^{t}

F \equiv Π_{k} U^{t} Λ_{1}^{- 1/2} U_{1}^{t}, Π_{k} F \equiv V^{t} Λ_{2}^{- 1/2} U_{2}^{t}

I (u, v) \geq I [f (x), g (y)]

I (u, v) \geq I [f (x), g (y)]

I (x, y) = I (2 x_{1} - x_{2}, y_{1}) = \frac{lo g \frac{3}{2}}{lo g 4} \approx 0.29 bits.

I (x, y) = I (2 x_{1} - x_{2}, y_{1}) = \frac{lo g \frac{3}{2}}{lo g 4} \approx 0.29 bits.

I (x, y)

I (x, y)

\dot{\bf z}=\left({\dot{\bf q}\atop\dot{\bf p}}\right)={\bf B}\left({{\bf q}\atop{\bf p}}\right)={\bf B}{\bf z},\quad{\bf B}\equiv\left(\begin{tabular}[]{cc}$\>\>\>{\bf 0}$&${\bf I}$\\ $-{\bf K}$&${\bf\Gamma}$\end{tabular}\right),

\dot{\bf z}=\left({\dot{\bf q}\atop\dot{\bf p}}\right)={\bf B}\left({{\bf q}\atop{\bf p}}\right)={\bf B}{\bf z},\quad{\bf B}\equiv\left(\begin{tabular}[]{cc}$\>\>\>{\bf 0}$&${\bf I}$\\ $-{\bf K}$&${\bf\Gamma}$\end{tabular}\right),

z (t) = e^{B t} z (0) .

z (t) = e^{B t} z (0) .

z_{i + 1} = A z_{i} + n_{i},

z_{i + 1} = A z_{i} + n_{i},

{\bf T}^{(k)}=\langle{\bf x}_{i}{\bf x}_{i+k}^{t}\rangle=\left(\begin{tabular}[]{cc}${\bf C}$&$({\bf A}^{k}{\bf C})^{t}$\\ ${\bf A}^{k}{\bf C}$&${\bf C}$\end{tabular}\right),

{\bf T}^{(k)}=\langle{\bf x}_{i}{\bf x}_{i+k}^{t}\rangle=\left(\begin{tabular}[]{cc}${\bf C}$&$({\bf A}^{k}{\bf C})^{t}$\\ ${\bf A}^{k}{\bf C}$&${\bf C}$\end{tabular}\right),

(x _{2} x _{1}) = (A _{2} A _{1}) s + (n _{2} n _{1}),

(x _{2} x _{1}) = (A _{2} A _{1}) s + (n _{2} n _{1}),

{\bf A}\equiv{\bf P}^{-1}{\bf D}^{-1}\left(\begin{tabular}[]{cc}${\bf R}^{1/2}$\\ ${\bf R}^{1/2}$\end{tabular}\right),

{\bf A}\equiv{\bf P}^{-1}{\bf D}^{-1}\left(\begin{tabular}[]{cc}${\bf R}^{1/2}$\\ ${\bf R}^{1/2}$\end{tabular}\right),

{\bf\Sigma}\equiv{\bf D}^{-1}\left(\begin{tabular}[]{cc}${\bf I}-{\bf R}$&${\bf 0}$\\ ${\bf 0}$&${\bf I}-{\bf R}$\end{tabular}\right){\bf D}^{-t}.

{\bf\Sigma}\equiv{\bf D}^{-1}\left(\begin{tabular}[]{cc}${\bf I}-{\bf R}$&${\bf 0}$\\ ${\bf 0}$&${\bf I}-{\bf R}$\end{tabular}\right){\bf D}^{-t}.

⟨ x x^{t} ⟩

⟨ x x^{t} ⟩

=

=

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTime Series Analysis and Forecasting · Topic Modeling · Artificial Intelligence in Games

Full text

Latent Representations of Dynamical Systems: When Two is Better Than One

Max Tegmark

Dept. of Physics, MIT Kavli Institute & Center for Brains, Minds & Machines, Massachusetts Institute of Technology, Cambridge, MA 02139; [email protected]

Abstract

A popular approach for predicting the future of dynamical systems involves mapping them into a lower-dimensional “latent space” where prediction is easier. We show that the information-theoretically optimal approach uses different mappings for present and future, in contrast to state-of-the-art machine-learning approaches where both mappings are the same. We illustrate this dichotomy by predicting the time-evolution of coupled harmonic oscillators with dissipation and thermal noise, showing how the optimal 2-mapping method significantly outperforms principal component analysis and all other approaches that use a single latent representation, and discuss the intuitive reason why two representations are better than one. We conjecture that a single latent representation is optimal only for time-reversible processes, not for e.g. text, speech, music or out-of-equilibrium physical systems.

I Introduction

A core challenge in physics (and in life quite generally) is data distillation: keeping only a manageably small fraction of our available data that nonetheless retains most of the information that is useful to us. Ideally, the information can be partitioned into a set of independent chunks and sorted from most to least useful, enabling us to select the number of chunks to retain so as to optimize our tradeoff between utility and and data size.

Consider a random vector ${\bf x}$ , and partition its elements into two parts:

[TABLE]

We may, for example, interpret the vectors ${\bf x}_{1}$ and ${\bf x}_{2}$ as observations of two separate systems at the same time, or as two separate observations of the same system some fixed time interval $\Delta t$ apart. Let us now consider various forms of ideal data distillation, as summarized in Table 1.

If we distill ${\bf x}$ as a whole, then we would ideally like to find a function $f$ such that the so-called latent representation ${\bf u}=f({\bf x})$ retains the full entropy $H({\bf x})=H({\bf u})=\sum H(u_{i})$ , decomposed into independent111When implementing any distillation algorithm in practice, there is always a one-parameter tradeoff between compression and information retention which defines a Pareto frontier. A key advantage of the latent variables (or variable pairs) being statistically independent is that this allows the Pareto frontier to be trivially computed, by simply sorting them by decreasing information content and varying the number retained.

parts with vanishing mutual infomation: $I(u_{i},u_{j})=\delta_{ij}H(u_{i}).$ For the special case where ${\bf x}$ has a multivariate Gaussian distribution, the optimal solution is Principal Component Analysis (PCA) Pearson (1901), which has long been a workhorse of statistical physics and many other disciplines: here $f$ is simply a linear function mapping into the eigenbasis of the covariance matrix of ${\bf x}$ . The general case remains unsolved, and it is easy to see that it is hard: if ${\bf x}=c({\bf u})$ where $c$ implements some state-of-the-art cryptographic code, then finding $f=c^{-1}$ (to recover the independent pieces of information and discard the useless parts) would generically require breaking the code. Great progress has nonetheless been made for many special cases, using techniques such as nonlinear autoencoders Vincent et al. (2008) and Generative Adversarial Networks (GANs) Goodfellow et al. (2014).

Now consider the case where we wish to distill ${\bf x}_{1}$ and ${\bf x}_{2}$ separately, into ${\bf u}\equiv f({\bf x}_{1})$ and ${\bf v}=g({\bf x}_{2})$ , retaining the mutual information between the two parts. Then we ideally have $I({\bf x},{\bf y})=\sum I(u_{i},v_{i})$ , $I(u_{i},u_{j})=\delta_{ij}H(u_{i}),$ $I(v_{i},v_{j})=\delta_{ij}H(v_{i}),$ $I(u_{i},v_{j})=\delta_{ij}I(u_{i},v_{i}).$ This problem has attracted great interest, especially for time series where ${\bf x}_{1}={\bf z}_{i}$ and ${\bf x}_{2}={\bf z}_{j}$ for some sequence of states ${\bf z}_{k}$ ( $k=0,1,2,...$ ) in physics or other fields, where one typically maps the state vectors ${\bf z}_{i}$ into some lower-dimensional vectors $f({\bf z}_{i})$ , after which the prediction is carried out in this latent space. For the special case where ${\bf x}$ has a multivariate Gaussian distribution, the optimal solution is Canonical Correlation Analysis (CCA) Hotelling (1936): here both $f$ and $g$ are linear functions, computed via a singular-value decomposition (SVD) Eckart and Young (1936) of the cross-correlation matrix after prewhitening ${\bf x}_{1}$ and ${\bf x}_{2}$ . The general case remains unsolved, and is obviously even harder than the above-mentioned 1-vector autoencoding problem. The recent DeepMind paper Oord et al. (2018) reviews the state-of-the art as well as presenting Contrastive Predictive Coding, a powerful new distillation technique for time series, following the long tradition of setting $f=g$ .

The purpose of this paper is to further investigate the case for choosing $f\neq g$ . We will do this by studying the lower-left quadrant of Table 1, where information-theoretically optional results can be derived, and using these results to discuss implications for the harder problem in the lower-right quadrant. The rest of this paper is organized as follows. Section II discusses analytic results for the lower-left quadrant. In Section III, the optimal $f\neq g$ method is benchmarked on a physics example, showing significant improvement over $f=g$ methods. Our conclusions are discussed in Section IV.

II CCA implications for latent representations: two is better than one

II.1 Notation

Without loss of generality, we take the random vector ${\bf x}$ from equation (1) to have vanishing mean $\langle{\bf x}\rangle=0$ , and write its covariance matrix as

[TABLE]

Modeling the probability distribution of ${\bf x}$ as a multivariate Gaussian, the mutual information between ${\bf x}_{1}$ and ${\bf x}_{2}$ is

[TABLE]

where we take $\log$ to denote the logarithm in base 2 so that information is measured in bits.

As mentioned, PCA elegantly decomposes the information content in a single random vector ${\bf x}$ into a mutually exclusive and collectively exhaustive set of information chunks corresponding to statistically independent numbers (eigenmodes coefficients) whose individual entropies add up to the total entropy. CCA generalizes this idea to mutual information, decomposing the total mutual information between ${\bf x}_{1}$ and ${\bf x}_{2}$ as a sum of the mutual information between a series of statistically independent pairs of numbers that are linear combinations of the two vectors, as summarized in Table 1 and in the following subsection.

II.2 CCA implementation

To do this, CCA first diagonalizes ${\bf C}_{0}$ and ${\bf C}_{1}$ as

[TABLE]

where ${\bf U}_{1}$ and ${\bf U}_{2}$ are orthogonal matrices and ${\bf\Lambda}_{1}$ and ${\bf\Lambda}_{2}$ are diagonal, with the eigenvalues (which are non-negative up to numerical rounding errors) sorted in decreasing order. It then constructs a prewhitening matrix

[TABLE]

where the function $s(\lambda)\equiv\lambda^{-1/2}$ , and a function of a diagonal matrix is defined by applying it to each diagonal element. In many practical applications, some covariance matrix eigenvalues are near zero (and occasionally get evaluated as slightly negative due to numerical rounding errors), so below we implement CCA more robustly by instead defining

[TABLE]

for some desired numerical precision floor $\epsilon$ . The matrix ${\bf P}$ transforms ${\bf T}$ into a block form

[TABLE]

where the zero-rows correspond to the eigenvalues that are so tiny that we round them to zero, and the matrix ${\bf Q}$ can be interpreted as the Pearson correlation coefficients between the elements of the prewhitened vectors ${\bf x}_{1}$ and ${\bf x}_{2}$ . CCA now performs a singular-value decomposition (SVD) of ${\bf Q}$ Eckart and Young (1936):

[TABLE]

where the matrices ${\bf U}$ and ${\bf V}$ are orthogonal, ${\bf R}=\hbox{diag}\,\{r_{i}\}$ is a diagonal matrix and $-1\leq r_{i}\leq 1$ . Defining

[TABLE]

we can now transform our original covariance matrix ${\bf T}$ into the simple form

[TABLE]

This means that when the random vector ${\bf x}$ is transformed into

[TABLE]

there are no correlations between any elements of ${\bf x}^{\prime}$ except between what we will term “principal pairs” (to emphasize the analogy with principal components), matching elements in ${\bf x}_{1}^{\prime}$ and ${\bf x}_{2}^{\prime}$ , which have a Pearson correlation coefficient $\langle({\bf x}_{1}^{\prime})_{i}({\bf x}_{2}^{\prime})_{i}\rangle=r_{i}\in[-1,1]$ .

Mutual information is independent of invertible reparametrizations of ${\bf x}_{1}$ and ${\bf x}_{2}$ , so the mutual information from equation (3) now simplifies to

[TABLE]

if there are no numerically negligible eigenvalues. If there are numerically negligible eigenvalues, the practically useful mutual information is given by this same formula, since the corresponding eigenmodes are numerically untrustworthy. This mutual information shared between ${\bf x}_{1}$ and ${\bf x}_{2}$ can be intuitively interpreted as stemming from a common source, as explained in Appendix A.

II.3 One versus two latent representations

As mentioned, dimensionality reduction is a popular approach to predicting the future of a time series ${\bf z}_{i}$ ( $i=0,1,2,...$ ) in physics and other fields: the state vectors ${\bf z}_{i}$ are mapped into some lower-dimensional vectors $f({\bf z}_{i})$ by an invertible mapping $f$ that hopefully captures the most relevant information, after which the prediction is carried out in this latent space. This mapping can be either linear, such as in PCA or Independent Component Analysis Hyvärinen and Oja (2000), or non-linear as in autoencoders Vincent et al. (2008), Generative Adversarial Networks Goodfellow et al. (2014) or Contrastive Predictive CodingOord et al. (2018).

For the special case of multivariate Gaussian probability distributions, the the formulas above imply that CCA provides the optimal dimensionality reduction, with the twist that the mappings into the latent space should generally be different for the predictor vector and the predicted vector: we can define the CCA dimensionality reduction as the mapping

[TABLE]

into the latent space $\mathbb{R}^{k}$ , where

[TABLE]

and the projection matrix ${\mathbf{\Pi}}_{k}$ simply picks the first $k$ elements of vector following it. The CCA construction above is readily seen to imply that

[TABLE]

for any functions $f$ and $g$ mapping ${\bf x}$ and ${\bf y}$ into $\mathbb{R}^{k}$ , and $I({\bf u},{\bf v})=I({\bf x},{\bf y})$ when the dimensionality is not reduced below that of both ${\bf x}$ and ${\bf y}$ .

Note that generically, ${\bf F}\neq{\bf G}$ , even for time series where ${\bf x}$ and ${\bf y}$ live in the same space. The following theorem shows that this is a feature, not a bug

Theorem: A single latent representation sometimes underperforms two separate ones, capturing less mutual information in some given number of variables.

**Proof: ** A simple counterexample (to the hypothesis that a single representation is equally good) is that is provided by four random variables $x_{1}$ , $x_{2}$ , $y_{1}$ , $y_{2}$ with unit variance and no correlations except that $\langle x_{1}x_{2}\rangle=\langle x_{1}y_{1}\rangle=1/2$ . The CCA described above shows that the mutual information can be entirely captured by a single principal pair $\{u_{1},v_{1}\}\equiv\{2x_{1}-x_{2},y_{1}\}$ ; specifically,

[TABLE]

If we instead transform both ${\bf x}$ and ${\bf y}$ using a single latent representation, then the maximal mutual information we can attain from a single principal pair is smaller:

[TABLE]

as can be seen in Figure 1. Here we have without loss of generality taken ${\bf w}$ to be a unit vector ${\bf w}=\{\cos\theta,\sin\theta\}$ , since the mutual information above is invariant under rescaling ${\bf w}$ .

As mentioned, most published work uses merely a single latent representation for time series prediction. This is clearly not optimal for the general case, since we have just proven that it is not even optimal for the simple case of multivariate Gaussian distributions. But does this suboptimality really matter in practice? The suboptimality is seen to be only 0.05 bits in Figure 1), so it is natural to ask whether a single latent representation $f$ is generically close to optimal, or whether the further improved from adding a second reporesentation $g\neq f$ provides large enough an improvement to be worth the extra complication. We address this questions in the next section by a practical application.

III Example: Coupled harmonic oscillators with dissipation and thermal noise

To better compare the predictive abilities of CCA with other approaches, let us now consider the physics problem of predicting the future state of a set of $n$ coupled 1-dimensional harmonic oscillators that are damped by friction and perturbed by random thermal noise. We group the positions and momenta of the oscillators into the $n$ -dimensional vectors ${\bf q}$ and ${\bf p}$ , which we in turn group into a single $2n$ -dimensional state vector ${\bf z}$ . We set all masses equal to unity, so $\dot{{\bf q}}={\bf p}$ , and take the laws of motion to be $\dot{\bf p}=-{\bf K}{\bf q}-{\bf\Gamma}{\bf p}$ for some positive semidefinite spring matrix ${\bf K}$ and friction matrix ${\bf\Gamma}$ . This means that we can write

[TABLE]

which has the solution

[TABLE]

All eigenvalues of ${\bf B}$ are negative, so to prevent ${\bf z}(t)$ from simply decaying toward zero, we add random Gaussian noise of standard deviation $\sigma$ to each position at every time step $\tau$ , Defining ${\bf z}_{i}\equiv{\bf z}(\tau i)$ , we can thus rewrite our time-evolution as a Markovian autoregressive process:

[TABLE]

where ${\bf A}\equiv e^{{\bf B}\tau}$ , $\langle{\bf n}_{i}\rangle={\bf 0}$ , $\langle{\bf n}_{i}{\bf n}_{j}^{t}\rangle=\delta_{ij}{\bf\Sigma}$ , ${\bf\Sigma}_{kl}=\delta_{kl}\sigma^{2}$ if $k\leq n$ and zero otherwise (the standard deviation is $\sigma$ for position noise, zero for momentum noise). This random process will eventually converge to a stationary state whose probability distribution is time-independent, since all eigenvalues of ${\bf A}$ have magnitude below unity, so that memory of the past gets exponentially damped over time. Figure 2 shows an example for $n=10$ oscillators arranged in a circle. Here and below we use time step $\tau=1$ , noise level $\sigma=1$ , friction matrix ${\bf\Gamma}=\gamma{\bf I}$ with $\gamma=0.05$ , and spring matrix ${\bf K}$ corresponding to nearest-neighbor coupling $\alpha^{2}=0.2$ and self-coupling $\omega^{2}=0.01$ , i.e., $K_{ij}=\omega^{2}/2+\alpha^{2}$ if $i=j$ , $K_{ij}=-\alpha^{2}/2$ for nearest neighbors, and $K_{ij}=0$ otherwise.

Once stationarity has been attained, the mean $\langle{\bf z}\rangle$ vanishes and equation (18) implies that the time-independent covariance matrix ${\bf C}\equiv\langle{\bf z}_{i}{\bf z}_{i}^{t}\rangle$ satisfies ${\bf C}={\bf A}{\bf C}{\bf A}^{t}+{\bf\Sigma}$ . This is known as the Lyapunov equation, and is readily solved for ${\bf C}$ by special-purpose techniques or, rapidly enough, by simply iterating it to convergence.

We are interested in using the state vector ${\bf z}_{i}$ to predict the subsequent state ${\bf z}_{i+k}$ . Arranging these two $2n$ -dimensional vectors into a single $4n$ -dimensional vector ${\bf x}$ , we can now compute the 2-time covariance matrix of equation (2) by iterating equation (18) $k$ times:

[TABLE]

Figure 3 shows the mutual information $I({\bf x}_{i},{\bf x}_{i+10})$ between the the state of our harmonic oscillators and their state $10$ time-steps later, as a function of the amount of dimensionality reduction performed. The CCA curve plotted is by definition the Pareto frontier for the tradeoff between compression and information retention, i.e., the the maximum amount of information that can be collectively retained in a given number of pairs. The CCA curve is seen to be concave because all Pareto frontiers by construction have non-positive second derivative.

Three other dimensionality reduction methods are also plotted for comparison, and are all seen to perform significantly worse than CCA. These all use the same latent representation for both ${\bf x}_{i}$ and ${\bf x}_{i+10}$ . The PCA curve keeps the top principal components, while the “simple truncation” curve simply retains the first elements of the vectors ${\bf x}_{i}$ and ${\bf x}_{i+10}$ . The “shared latent rep” curve uses the CCA-matrix ${\bf F}$ to compress ${\bf x}$ , and uses the same matrix ${\bf F}$ (rather than ${\bf G}$ ) to compress ${\bf x}_{i+10}$ .

If each pair of numbers $\{u_{i},v_{i}\}$ contained the same fraction of the total mutual information, then the Pareto frontier would be a straight line. The performance of the non-CCA methods is seen to be even worse in our example, closer to a parabola (dashed line). This is what one expects from any method where each number $u_{i}$ contains a random fraction $1/N$ of the total information, and the same holds for each number $v_{i}$ : then the mutual information $I(u_{i},v_{j})=1/N^{2}$ , and the information fraction shared by $N^{\prime}$ numbers $u_{i}$ and $N^{\prime}$ numbers $v_{i}$ is $(N^{\prime}/N)^{2}$ – a parabola.

IV Conclusion

It is often useful to map data vectors into a lower-dimensional latent space, retaining only the information of interest, as summarized in Figure 1. For linear mappings $f$ and $g$ , the natural generalization of PCA is CCA, which distills two random vectors ${\bf x}$ and ${\bf y}$ into linear transformations ${\bf u}={\bf F}{\bf x}$ and ${\bf v}={\bf G}{\bf y}$ such that all components of both vectors are uncorrelated, except that matching “principal pairs” have correlation ${\bf r}_{i}\in[-1,1]$ . For Gaussian random vectors, CCA conveniently decomposes the total mutual information between ${\bf x}$ and ${\bf y}$ as the sum of the mutual information $\log(1-r_{i}^{2})$ between these principal pairs. Retention of only the $k$ most informative pairs thus falls on the Pareto frontier of optimal dimensionality reduction.

There is strong current interest in how to best generalize this to nonlinear mappings optimized for non-Gaussian random vectors. Most recent work for non-linear time-series prediction (see Oord et al. (2018) and references therein) focuses on the special case $f=g$ . As we have explored in detail, this can be far from optimal, even in the linear case. In the linear case, the reason why two different latent representations ${\bf F}$ and ${\bf G}$ are better than one ultimately traces back to the fact that the SVD in equation (8) produces different matrices ${\bf U}\neq{\bf V}$ when ${\bf Q}$ is asymmetric. For our harmonic oscillator example (or any physics time series whatsoever, for that matter), this asymmetry corresponds to time reversal asymmetry: the operation to predict the future state from the past is not the same as that predicting the past from the future. In our example, this asymmetry can be eliminated by ignoring all momentum information: Repeating CCA to predict ${\bf q}_{i+k}$ from ${\bf q}_{i}$ (as opposed to working with ${\bf x}_{i}$ which includes both ${\bf q}_{i}$ and ${\bf p}_{i}$ ), we obtain ${\bf F}={\bf G}$ .

A natural conjecture for future work to investigate is that this generalizes to non-Gaussian time series where the optimal dimensionality reduction is nonlinear: that a single latent representation suffices for reversible Markov chains and other reversible processes, while a pair of representations performs better for processes that are truly different in reverse, for example text, speech, music or out-of-equilibrium physical systems.

Figure 4 motivates this conjecture: when the evolution of a dynamical system is not time-reversible, then the information about its state that is useful for predicting what will happen is often different from that which is useful for predicting what happened. Consider, for example, a table with an orange and a pencil balanced on its tip, both isolated from the rest of the world and seemingly at rest. If we are interested in predicting the future, we should pay more attention to the orange, since it will stay put while the pencil will tip over in less than 30 seconds in a direction that we cannot predict, as quantum-mechanical fluctuations get amplified by gravity. If we are interested in inferring the past, we should instead focus on the pencil, which was almost certainly at least as balanced then as it is now. The orange, on the other hand, is in a stable attractor state: it could either have just sat there, or slid/rolled and come to rest due to dissipation. In physics examples such as these, whether information is predictable, predictive or both thus depends on attractor dynamics and Lyapunov exponents: degrees of freedom in stable equilibria (with negative Lyapunov exponents) are predictive but unpredictable, while those in unstable equilibria (with positive Lyapunov exponents) are unpredictive but predictable.

The causally irrelevant information, that helps with neither, should obviously be discarded in latent representations; if most of the information is in this category, then non-causal approaches such as PCA, ICA and nonlinear autoencoders may mistakenly retain much of this information if it is easy to distill into a small number of variables. Renormalization in physics provides an example of a single latent representation where the vast majority of the information (typically information on very small scales/high frequencies) is discarded, while both the longer-term predictive and predictable information is retained. In other cases, the predictable and predictive degrees of freedom parts of the information may be closer to disjoint, in which case switching to two separate representations offers the opportunity of cutting the latent space dimensionality almost in half. An interesting challenge for future work is therefore to explore whether approaches such as Oord et al. (2018) can be further improved by using more than one latent representation, or by developing a single one that better optimizes for both predictiveness and predictability.

Acknowledgements: The author wishes to thank Tailin Wu for helpful comments, and for suggesting Figure 4 and the idea behind it. This work was supported by The Casey and Family Foundation, the Foundational Questions Institute, and by Theiss Research through TWCF grant #0322.

Appendix A Distillation interpretation

The following theorem allows an intuitive interpretation of correlation as stemming from a common source.

Theorem: Any random vector pair can be decomposed as the sum of a perfectly correlated part (“signal”) and a perfectly uncorrelated part (“noise”):

[TABLE]

where $\langle{\bf s}{\bf s}^{t}\rangle={\bf I}$ , $\langle{\bf n}_{1}{\bf n}_{2}^{t}\rangle={\bf 0}$ , $\langle{\bf s}{\bf n}_{1}^{t}\rangle={\bf 0}$ and $\langle{\bf s}{\bf n}_{2}^{t}\rangle={\bf 0}$ .

Proof: Writing the last equation as ${\bf x}={\bf A}{\bf s}+{\bf n}$ , we have $\langle{\bf n}\rangle={\bf 0}$ , and $\langle{\bf s}{\bf n}^{t}\rangle={\bf 0}$ . Define ${\bf\Sigma}\equiv\langle{\bf n}{\bf n}^{t}\rangle$ . If ${\bf x}_{1}$ and ${\bf x}_{2}$ have the same length and no eigenvalues of ${\bf C}_{1}$ or ${\bf C}_{2}$ vanish, then define

[TABLE]

It follows that

[TABLE]

as required, where we used equation (10) in the penultimate step. The case with vanishing eigenvalues and/or unequal length of ${\bf x}_{1}$ and ${\bf x}_{1}$ follows straightforwardly from zero-padding appropriately.

We can thus interpret ${\bf s}$ as the distillation of all the correlated information in ${\bf x}_{1}$ and ${\bf x}_{2}$ , which is shared but diluted by noise.

Bibliography7

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Pearson (1901) K. Pearson, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2 , 559 (1901).
2Vincent et al. (2008) P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, in Proceedings of the 25th international conference on Machine learning (ACM, 2008), pp. 1096–1103.
3Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, in Advances in neural information processing systems (2014), pp. 2672–2680.
4Hotelling (1936) H. Hotelling, Biometrica 28 , 321 (1936).
5Oord et al. (2018) A. v. d. Oord, Y. Li, and O. Vinyals, ar Xiv preprint ar Xiv:1807.03748 (2018).
6Hyvärinen and Oja (2000) A. Hyvärinen and E. Oja, Neural networks 13 , 411 (2000).
7Eckart and Young (1936) C. Eckart and G. Young, Psychometrika 1 , 211 (1936).