Time-varying Autoregression with Low Rank Tensors

Kameron Decker Harris; Aleksandr Aravkin; Rajesh Rao; Bingni Wen; Brunton

arXiv:1905.08389·stat.ML·May 21, 2020

Time-varying Autoregression with Low Rank Tensors

Kameron Decker Harris, Aleksandr Aravkin, Rajesh Rao, Bingni Wen, Brunton

PDF

2 Repos

TL;DR

This paper introduces a scalable, low-rank tensor approach for learning interpretable, time-varying autoregressive models from multivariate time series, capable of handling high-dimensional data and incorporating priors.

Contribution

The paper proposes a novel windowed, tensor-based method for modeling time-varying autoregressive systems, with proven convergence and demonstrated effectiveness on synthetic and real data.

Findings

01

Successfully identifies true rank in noisy synthetic data.

02

Outperforms existing methods in scalability and accuracy.

03

Effectively models diverse real-world datasets.

Abstract

We present a windowed technique to learn parsimonious time-varying autoregressive models from multivariate timeseries. This unsupervised method uncovers interpretable spatiotemporal structure in data via non-smooth and non-convex optimization. In each time window, we assume the data follow a linear model parameterized by a system matrix, and we model this stack of potentially different system matrices as a low rank tensor. Because of its structure, the model is scalable to high-dimensional data and can easily incorporate priors such as smoothness over time. We find the components of the tensor using alternating minimization and prove that any stationary point of this algorithm is a local minimum. We demonstrate on a synthetic example that our method identifies the true rank of a switching linear system in the presence of noise. We illustrate our model's utility and superior scalability…

Tables1

Table 1. Table 1: TVART parameters

Problem name	$N$	$M$	$T$	$R$	$η$	$β$	$ℛ$	affine?
Switching linear	6–4000	20	10	4, 6	$1 / N$	5	TV	no
Smoothly varying linear	6–4000	1	160	4	$6 / N$	600 $\log_{10}^{2} (N)$	Spline	no
Worm behavior	4	6	33	6	0.05	6	TV	yes
Sea surface temperature	1259	9	171	6	$10^{- 3}$	$10^{4}$	Spline	no
Neural activity	64	200	1999	8	1	100	TV	no

Equations56

\mathbf{X}_{1}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(1)&\bm{x}(2)&\ldots&\bm{x}(M)\\ |&|&&|\end{array}\right]\quad\mbox{and}\quad\mathbf{Y}_{1}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(2)&\bm{x}(3)&\ldots&\bm{x}(M+1)\\ |&|&&|\end{array}\right].

\mathbf{X}_{1}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(1)&\bm{x}(2)&\ldots&\bm{x}(M)\\ |&|&&|\end{array}\right]\quad\mbox{and}\quad\mathbf{Y}_{1}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(2)&\bm{x}(3)&\ldots&\bm{x}(M+1)\\ |&|&&|\end{array}\right].

A : rank (A) = R min \frac{1}{2} k = 1 \sum T ∥ Y_{k} - A_{k} X_{k} ∥_{F}^{2} .

A : rank (A) = R min \frac{1}{2} k = 1 \sum T ∥ Y_{k} - A_{k} X_{k} ∥_{F}^{2} .

A = r = 1 \sum R u_{r}^{(1)} \circ u_{r}^{(2)} \circ u_{r}^{(3)}

A = r = 1 \sum R u_{r}^{(1)} \circ u_{r}^{(2)} \circ u_{r}^{(3)}

{\mathbf{U}^{(i)}}=\left[\begin{array}[]{ccc}|&&|\\ \mathbf{u}^{(i)}_{1}&\ldots&\mathbf{u}^{(i)}_{R}\\ |&&|\end{array}\right].

{\mathbf{U}^{(i)}}=\left[\begin{array}[]{ccc}|&&|\\ \mathbf{u}^{(i)}_{1}&\ldots&\mathbf{u}^{(i)}_{R}\\ |&&|\end{array}\right].

A_{k} = U^{(1)} D^{(k)} U^{(2)}^{⊺}, \mbox w h er e D^{(k)} = diag (u_{k :}^{(3)}) .

A_{k} = U^{(1)} D^{(k)} U^{(2)}^{⊺}, \mbox w h er e D^{(k)} = diag (u_{k :}^{(3)}) .

L (U^{(1)}, U^{(2)}, U^{(3)}) = \frac{1}{2} k = 1 \sum T ∥ Y_{k} - U^{(1)} D^{(k)} U^{(2)}^{⊺} X_{k} ∥_{F}^{2} .

L (U^{(1)}, U^{(2)}, U^{(3)}) = \frac{1}{2} k = 1 \sum T ∥ Y_{k} - U^{(1)} D^{(k)} U^{(2)}^{⊺} X_{k} ∥_{F}^{2} .

\frac{\partial}{\partial B} \frac{1}{2} ∥ Y - A B^{⊺} C ∥_{F}^{2}

\frac{\partial}{\partial B} \frac{1}{2} ∥ Y - A B^{⊺} C ∥_{F}^{2}

= - C Y^{⊺} A + C C^{⊺} B A^{⊺} A .

\frac{\partial L}{\partial U ^{(1)}} = U^{(1)} (k = 1 \sum T D^{(k)} U^{(2)}^{⊺} X_{k} X_{k}^{⊺} U^{(2)} D^{(k)}) - (k = 1 \sum T Y_{k} X_{k}^{⊺} U^{(2)} D^{(k)}) .

\frac{\partial L}{\partial U ^{(1)}} = U^{(1)} (k = 1 \sum T D^{(k)} U^{(2)}^{⊺} X_{k} X_{k}^{⊺} U^{(2)} D^{(k)}) - (k = 1 \sum T Y_{k} X_{k}^{⊺} U^{(2)} D^{(k)}) .

\frac{\partial L}{\partial U ^{(2)}} = k = 1 \sum T X_{k} X_{k}^{⊺} U^{(2)} D^{(k)} U^{(1)}^{⊺} U^{(1)} D^{(k)} - k = 1 \sum T X_{k} Y_{k}^{⊺} U^{(1)} D^{(k)},

\frac{\partial L}{\partial U ^{(2)}} = k = 1 \sum T X_{k} X_{k}^{⊺} U^{(2)} D^{(k)} U^{(1)}^{⊺} U^{(1)} D^{(k)} - k = 1 \sum T X_{k} Y_{k}^{⊺} U^{(1)} D^{(k)},

\frac{\partial L}{\partial U ^{(2)}} = k = 1 \sum T L_{k} U^{(2)} R_{k} - B,

\frac{\partial L}{\partial U ^{(2)}} = k = 1 \sum T L_{k} U^{(2)} R_{k} - B,

\frac{\partial L}{\partial U ^{(2)}} = 0 ⟺ k = 1 \sum T L_{k} U^{(2)} R_{k} = B .

\frac{\partial L}{\partial U ^{(2)}} = 0 ⟺ k = 1 \sum T L_{k} U^{(2)} R_{k} = B .

vec (\frac{\partial L}{\partial U ^{(2)}}) = 0 ⟺ (k = 1 \sum T R_{k}^{⊺} \otimes L_{k}) vec (U^{(2)}) = vec (B) .

vec (\frac{\partial L}{\partial U ^{(2)}}) = 0 ⟺ (k = 1 \sum T R_{k}^{⊺} \otimes L_{k}) vec (U^{(2)}) = vec (B) .

\frac{\partial L}{\partial D ^{(k)}}

\frac{\partial L}{\partial D ^{(k)}}

= (L_{k}^{'} D^{(k)} R_{k}^{'} - U^{(2)}^{⊺} X_{k} Y_{k}^{⊺} U^{(1)}) * I,

(L_{k}^{'} D^{(k)} R_{k}^{'})_{ii} = ((L_{k}^{'} * R_{k}^{'}) (u_{k :}^{(3)})^{⊺})_{i},

(L_{k}^{'} D^{(k)} R_{k}^{'})_{ii} = ((L_{k}^{'} * R_{k}^{'}) (u_{k :}^{(3)})^{⊺})_{i},

vecdiag (\frac{\partial L}{\partial D ^{(k)}}) = (L_{k}^{'} * R_{k}^{'}) (u_{k :}^{(3)})^{⊺} - vecdiag (U^{(2)}^{⊺} X_{k} Y_{k}^{⊺} U^{(1)}),

vecdiag (\frac{\partial L}{\partial D ^{(k)}}) = (L_{k}^{'} * R_{k}^{'}) (u_{k :}^{(3)})^{⊺} - vecdiag (U^{(2)}^{⊺} X_{k} Y_{k}^{⊺} U^{(1)}),

\frac{1}{2 η} (∥ U^{(1)} ∥_{F}^{2} + ∥ U^{(2)} ∥_{F}^{2} + ∥ U^{(3)} ∥_{F}^{2})

\frac{1}{2 η} (∥ U^{(1)} ∥_{F}^{2} + ∥ U^{(2)} ∥_{F}^{2} + ∥ U^{(3)} ∥_{F}^{2})

\mathrm{TV}({\mathbf{U}^{(3)}})=\sum_{r=1}^{R}\sum_{k=2}^{T}\big{|}u^{(3)}_{kr}-u^{(3)}_{k-1,r}\big{|}=\sum_{r=1}^{R}\|\mathbf{D}\,\mathbf{u}^{(3)}_{:r}\|_{1}.

\mathrm{TV}({\mathbf{U}^{(3)}})=\sum_{r=1}^{R}\sum_{k=2}^{T}\big{|}u^{(3)}_{kr}-u^{(3)}_{k-1,r}\big{|}=\sum_{r=1}^{R}\|\mathbf{D}\,\mathbf{u}^{(3)}_{:r}\|_{1}.

\mathbf{D}=\left[\begin{array}[]{rrrrrr}1&-1&&&\ldots&0\\ &1&-1&&&\vdots\\ &&\ddots&\ddots&&\\ \vdots&&&1&-1&\\ 0&\ldots&&&1&-1\end{array}\right].

\mathbf{D}=\left[\begin{array}[]{rrrrrr}1&-1&&&\ldots&0\\ &1&-1&&&\vdots\\ &&\ddots&\ddots&&\\ \vdots&&&1&-1&\\ 0&\ldots&&&1&-1\end{array}\right].

\mathrm{Spline}({\mathbf{U}^{(3)}})=\frac{1}{2}\sum_{r=1}^{R}\sum_{k=2}^{T}\big{(}u^{(3)}_{kr}-u^{(3)}_{k-1,r}\big{)}^{2}=\frac{1}{2}\|\mathbf{D}{\mathbf{U}^{(3)}}\|_{F}^{2}.

\mathrm{Spline}({\mathbf{U}^{(3)}})=\frac{1}{2}\sum_{r=1}^{R}\sum_{k=2}^{T}\big{(}u^{(3)}_{kr}-u^{(3)}_{k-1,r}\big{)}^{2}=\frac{1}{2}\|\mathbf{D}{\mathbf{U}^{(3)}}\|_{F}^{2}.

C = \frac{1}{2} k = 1 \sum T ∥ Y_{k} - A_{k} X_{k} ∥_{F}^{2} + \frac{1}{2 η} (∥ U^{(1)} ∥_{F}^{2} + ∥ U^{(2)} ∥_{F}^{2} + ∥ U^{(3)} ∥_{F}^{2}) + β R (U^{(3)}),

C = \frac{1}{2} k = 1 \sum T ∥ Y_{k} - A_{k} X_{k} ∥_{F}^{2} + \frac{1}{2 η} (∥ U^{(1)} ∥_{F}^{2} + ∥ U^{(2)} ∥_{F}^{2} + ∥ U^{(3)} ∥_{F}^{2}) + β R (U^{(3)}),

C (U^{(1)}, U^{(2)}, U^{(3)}) = f_{0} (U^{(1)}, U^{(2)}, U^{(3)}) + f_{1} (U^{(3)}),

C (U^{(1)}, U^{(2)}, U^{(3)}) = f_{0} (U^{(1)}, U^{(2)}, U^{(3)}) + f_{1} (U^{(3)}),

S_{a} = {(U^{(1)}, U^{(2)}, U^{(3)}) : C (U^{(1)}, U^{(2)}, U^{(3)}) \leq a} .

S_{a} = {(U^{(1)}, U^{(2)}, U^{(3)}) : C (U^{(1)}, U^{(2)}, U^{(3)}) \leq a} .

\mathbf{X}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(1)&\bm{x}(2)&\ldots&\bm{x}(TM)\\ |&|&&|\end{array}\right]\quad\mbox{and}\quad\mathbf{Y}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(2)&\bm{x}(3)&\ldots&\bm{x}(TM+1)\\ |&|&&|\end{array}\right].

\mathbf{X}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(1)&\bm{x}(2)&\ldots&\bm{x}(TM)\\ |&|&&|\end{array}\right]\quad\mbox{and}\quad\mathbf{Y}=\left[\begin{array}[]{cccc}|&|&&|\\ \bm{x}(2)&\bm{x}(3)&\ldots&\bm{x}(TM+1)\\ |&|&&|\end{array}\right].

\frac{∣ C _{t} - C _{t - 1} ∣}{C _{t - 1}} < rtol or ∣ C_{t} - C_{t - 1} ∣ < atol .

\frac{∣ C _{t} - C _{t - 1} ∣}{C _{t - 1}} < rtol or ∣ C_{t} - C_{t - 1} ∣ < atol .

RMSE = \frac{1}{N M T} k = 1 \sum T ∥ Y_{k} - A_{k} X_{k} ∥_{F}^{2} .

RMSE = \frac{1}{N M T} k = 1 \sum T ∥ Y_{k} - A_{k} X_{k} ∥_{F}^{2} .

\mathbf{A}(t)=\mathbf{W}\left(\begin{array}[]{lr}\cos(\theta(t))&-\sin(\theta(t))\\ \sin(\theta(t))&\cos(\theta(t))\end{array}\right)\mathbf{W}^{\intercal},

\mathbf{A}(t)=\mathbf{W}\left(\begin{array}[]{lr}\cos(\theta(t))&-\sin(\theta(t))\\ \sin(\theta(t))&\cos(\theta(t))\end{array}\right)\mathbf{W}^{\intercal},

K (t, t^{'}) = exp [- (\frac{t - t ^{'}}{30})^{2}] + 0.001 δ_{t, t^{'}} .

K (t, t^{'}) = exp [- (\frac{t - t ^{'}}{30})^{2}] + 0.001 δ_{t, t^{'}} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Time-varying Autoregression with Low Rank Tensors

Kameron Decker Harris Corresponding author: Computer Science & Engineering, Biology, University of Washington, [email protected]

Aleksandr Aravkin Applied Mathematics, University of Washington, [email protected]

Rajesh Rao Computer Science & Engineering, University of Washington, [email protected]

Bingni Wen Brunton Biology, University of Washington, [email protected]

Abstract

We present a windowed technique to learn parsimonious time-varying autoregressive models from multivariate timeseries. This unsupervised method uncovers interpretable spatiotemporal structure in data via non-smooth and non-convex optimization. In each time window, we assume the data follow a linear model parameterized by a system matrix, and we model this stack of potentially different system matrices as a low rank tensor. Because of its structure, the model is scalable to high-dimensional data and can easily incorporate priors such as smoothness over time. We find the components of the tensor using alternating minimization and prove that any stationary point of this algorithm is a local minimum. We demonstrate on a synthetic example that our method identifies the true rank of a switching linear system in the presence of noise. We illustrate our model’s utility and superior scalability over extant methods when applied to several synthetic and real-world example: two types of time-varying linear systems, worm behavior, sea surface temperature, and monkey brain datasets.

1 Introduction

Data-driven linear models are a common approach to modeling multivariate timeseries and have been studied extensively in the mathematical sciences and applied fields. These domains include Earth and atmospheric sciences [23, 15, 1], fluid dynamics [33, 2, 9, 37], and neuroscience [8, 26]. Even when the underlying dynamics are nonlinear, such linear approximations may be justified by their connection to the eigenfunctions of the Koopman operator [9, 25], known as the dynamic mode decomposition [33, 39, 34]. In order to capture non-stationary behavior, it is important to leverage dynamic linear models (DLMs) that vary over time [41]. However, since every linear model is parametrized by a system matrix, and this matrix changes over time, a naive DLM fit to a length $T$ timeseries of $N$ variables has $\mathcal{O}(TN^{2})$ parameters, which could be very large. We propose to manage such complexity by representing the stack of system matrices as a low rank tensor with only $\mathcal{O}(T+2N)$ parameters (Figure 1).

There is a rich and extensive literature on DLMs, including time-varying autoregressive (TVAR) and switching linear dynamical systems (SLDS) models that we cannot review in full here [41]. Ours is a regularized optimization method [in the spirit of 18, 10, 11, 42, 36], complementary to the Bayesian approaches taken in much of the literature [31, 16, 14, 24]. We also would like to highlight the recent work of [13], which uses likelihood tests to adaptively segment a timeseries and fit different models to each segment. However, our approach is inherently more scalable due to the low rank assumptions we make.

The key innovation of our model is to parametrize the dynamics by a low rank tensor for computational tractability, ease of identification, and interpretation. Tensor decompositions [22] are a powerful technique for summarizing multivariate data and an area of ongoing research in theory [17, 3, among others] and applications [e.g., to improve neural networks 28, 20]. In our formulation, the system tensor representing the DLM is regressed against the data. In this aspect, our method is most similar to the work of Yu and colleagues who considered spatiotemporal forecasting with spatial smoothness regularization [4, 45, 44], in contrast to our temporal smoothness, although the possibility is mentioned in their dissertation [43]. Our work also differs in the emphasis on non-smooth regularization to find switching or other temporally structured behavior.

Notation

We follow the conventions of [22] for notation. In brief, tensors are denoted with calligraphic bold ( $\bm{\mathscr{A}}$ ), matrices with capital bold ( $\mathbf{A}$ ), and vectors with lower-case bold symbols ( $\bf a$ ). MATLAB-style colon notation represents slices, e.g. ${\bf A}_{i:}$ is the row vector formed from the $i$ th row of a matrix ${\bf A}$ . All of the tensors we consider are third-order. For a tensor $\bm{\mathscr{A}}$ , let ${\bf A}_{k}={\bm{\mathscr{A}}}_{::k}$ be its $k$ th frontal slice.

2 The TVART model

We now introduce our time-varying autoregressive model with low rank tensors (TVART, Figure 1). Assume we have sampled the trajectory of a dynamical system $\bm{x}(t)\in\mathbb{R}^{N}$ for $t=1,\ldots,\tau+1$ . We split this trajectory into $T$ non-overlapping windows of length $M$ , so that $TM=\tau$ . Let $\bm{\mathscr{X}}$ be the $N\times M\times T$ tensor with entries $(\bm{\mathscr{X}})_{ijk}=x_{i}((k-1)M+j),$ and similarly let $\bm{\mathscr{Y}}$ be a tensor of the same size with entries shifted by one time point, $(\bm{\mathscr{Y}})_{ijk}=x_{i}((k-1)M+j+1)$ . We call the frontal slices $\mathbf{X}_{k},\mathbf{Y}_{k}\in\mathbb{R}^{N\times R}$ the snapshot matrices for window $k$ . The first of these are

[TABLE]

The subsequent snapshots $\mathbf{X}_{k},\mathbf{Y}_{k}$ for $k>1$ are each shifted by $(k-1)M$ .

The goal of TVART is to fit an $N\times N\times T$ tensor $\bm{\mathscr{A}}$ of system matrices, so that $\mathbf{Y}_{k}\approx\mathbf{A}_{k}\mathbf{X}_{k}$ for $k=1,\ldots,T$ , where $\mathbf{A}_{k}$ is the $k$ th frontal slice of $\bm{\mathscr{A}}$ . The assumption underlying this goal is that $\bm{x}(t+1)\approx\mathbf{A}(t)\,\bm{x}(t)$ where $\mathbf{A}(t)$ is constant within a window. This assumption motivates the least squares optimization problem below, which is equivalent to assuming uncorrelated Gaussian errors:

[TABLE]

Without the rank constraint, TVART factors into decoupled problems for each window $\mathbf{A}_{k}$ . In order to limit the degrees of freedom in the tensor $\bm{\mathscr{A}}$ , we use a low rank formulation. Specifically, we represent $\bm{\mathscr{A}}$ using the canonical polyadic (CP) decomposition [22] of rank $R$ as

[TABLE]

in terms of the factor matrices ${\mathbf{U}^{(1)}}\in\mathbb{R}^{N\times R}$ , ${\mathbf{U}^{(2)}}\in\mathbb{R}^{N\times R}$ , and ${\mathbf{U}^{(3)}}\in\mathbb{R}^{T\times R}$ , where

[TABLE]

Thus, the number of parameters is reduced to $(2N+T)R$ , which is now linear in $N$ and $T$ . We can optionally normalize the factors to have unit-length columns and capture their scalings in a vector $\bm{\lambda}\in\mathbb{R}^{R}$ ; we consider this a postprocessing step and explicitly state when we do this.

When solving TVART for $\bm{\mathscr{A}}$ , we work directly with its frontal slices

[TABLE]

The matrix $\mathbf{D}^{(k)}$ is the $R\times R$ diagonal matrix formed from the $k$ th row of ${\mathbf{U}^{(3)}}$ [22].

Definition 2.1.

*We call the matrices ${\mathbf{U}^{(1)}}$ , ${\mathbf{U}^{(2)}}$ , and ${\mathbf{U}^{(3)}}$ the TVART dynamical modes. Specifically, we refer to ${\mathbf{U}^{(1)}}$ and ${\mathbf{U}^{(2)}}$ as the left and right spatial modes, since they determine the loadings of $\mathbf{A}_{k}$ onto the spatial dimensions/channels in the data. The matrix ${\mathbf{U}^{(3)}}$ contains the temporal modes, since it determines the time-variation of the system matrix $\mathbf{A}_{k}$ across windows. *

In some ways, (1) is similar to the singular value decomposition (SVD): each slice $\mathbf{A}_{k}$ is the product of a low rank matrix ${\mathbf{U}^{(1)}}$ , a diagonal matrix $\mathbf{D}^{(k)}$ , and another low rank matrix ${\mathbf{U}^{(2)}}^{\intercal}$ . However, unlike the SVD, the left and right spatial modes are not orthogonal. Let ${\mathbf{U}^{(1)}}=\mathbf{Q}^{(1)}\mathbf{R}^{(1)}$ and ${\mathbf{U}^{(2)}}=\mathbf{Q}^{(2)}\mathbf{R}^{(2)}$ be the QR decompositions of the left and right spatial modes, so that $\mathbf{A}_{k}=\mathbf{Q}^{(1)}\mathbf{R}^{(1)}\mathbf{D}^{(k)}\mathbf{R}^{(2)\intercal}\mathbf{Q}^{(2)\intercal}$ . Thus, in order to calculate the SVD of $\mathbf{A}_{k}$ , we would have to take the SVD of the $R\times R$ matrix $\mathbf{R}^{(1)}\mathbf{D}^{(k)}\mathbf{R}^{(2)\intercal}$ . In the CP decomposition, slices $\mathbf{A}_{k}$ and $\mathbf{A}_{k^{\prime}}$ may have different left and right singular vectors, but these singular vectors are always in the span of $\mathbf{Q}^{(1)}$ and $\mathbf{Q}^{(2)}$ . This flexibility allows TVART to fit multiple linear models with different singular subspaces.

2.1 Extensions: affine dynamics and higher-order autoregressions

In many applications, the mean of the data may drift over time, and thus affine models of the dynamics $\mathbf{x}(t+1)=\mathbf{A}_{k}\mathbf{x}(t)+\mathbf{b}_{k}$ are more appropriate than linear models [21]. We can fit an affine model of this type within the TVART framework by appending a row of ones to each $\mathbf{X}_{k}$ and extending ${\mathbf{U}^{(2)}}$ by one row to build in a $\mathbf{b}_{k}$ term. In this case, we have that $\mathbf{b}_{k}={\mathbf{U}^{(1)}}\mathbf{D}^{(k)}\mathbf{c}$ , where $\mathbf{c}$ is the extra row $\mathbf{u}^{(2)}_{N+1,:}$ .

Furthermore, autoregressive models of higher order are often considered, where $\mathbf{x}(t+1)$ is predicted from data with $P$ lags $\mathbf{x}(t),\ldots,\mathbf{x}(t-P+1)$ [41, 31]. In this case, the dimensions of $\bm{\mathscr{X}}$ , $\bm{\mathscr{A}}$ , and ${\mathbf{U}^{(2)}}$ change to $NP\times M\times T$ , $N\times NP\times T$ , and $NP\times R$ , respectively, but otherwise the mathematics remain equivalent. For simplicity, we focus on just the $P=1$ case, but higher-order autoregressive models are likely better-suited to certain applications.

3 Alternating least squares algorithm

We now describe in detail the optimization routine used to solve (TVART). Define the loss function to be

[TABLE]

Minimizing this loss is an equivalent formulation of the TVART problem. The loss (2) is quadratic and convex in each of the variables ${\mathbf{U}^{(1)}},{\mathbf{U}^{(2)}}$ , and ${\mathbf{U}^{(3)}}$ , but it is not jointly convex in all variables. This motivates our use of an alternating minimization algorithm to minimize this cost; this is also called coordinate descent or, since the loss is quadratic, alternating least squares (ALS). Minimizing over any one of the ${\mathbf{U}^{(i)}}$ is a least squares problem that can be solved by a linear matrix equation. In the next three subsections, we compute the gradients of $L$ with respect to each set of variables and discuss methods for solving the normal equations $\frac{\partial L}{\partial{\mathbf{U}^{(i)}}}=0$ . Alternating minimization is used in our final approach, Algorithm 1, which incorporates regularization.

We will use the following useful identity mutiple times:

[TABLE]

3.1 Left Spatial Modes ${\mathbf{U}^{(1)}}$

Taking partial derivatives of $L$ with respect to ${\mathbf{U}^{(1)}}$ , we can employ (3) with $\mathbf{Y}\to\mathbf{Y}_{k},\mathbf{A}\to\mathbf{I},\mathbf{C}\to\mathbf{D}^{(k)}{\mathbf{U}^{(2)}}^{\intercal}\mathbf{X}_{k}$ and $\mathbf{B}\to{\mathbf{U}^{(1)}}^{\intercal}$ and take the transpose to get

[TABLE]

Setting $\frac{\partial L}{\partial{{\mathbf{U}^{(1)}}}}=0$ means we must solve a matrix equation for ${\mathbf{U}^{(1)}}$ . This requires forming an $N\times R$ matrix, and right multiplying by the pseudoinverse of an $R\times R$ matrix.

3.2 Right Spatial Modes ${\mathbf{U}^{(2)}}$

We now differentiate $L$ with respect to ${\mathbf{U}^{(2)}}$ . Replacing $\mathbf{Y}\to\mathbf{Y}_{k},\mathbf{A}\to{\mathbf{U}^{(1)}}\mathbf{D}^{(k)},\mathbf{C}\to\mathbf{X}_{k}$ , and $\mathbf{B}\to{\mathbf{U}^{(2)}}$ in (3) leads to the following:

[TABLE]

which we can rewrite in the form

[TABLE]

where $\mathbf{B}\in\mathbb{R}^{N\times R}$ , $\mathbf{L}_{k}\in\mathbb{R}^{N\times N}$ , and $\mathbf{R}_{k}\in\mathbb{R}^{R\times R}$ .

We set $\frac{\partial L}{\partial{{\mathbf{U}^{(2)}}}}=0$ , so that (6) implies a Sylvester equation for the optimal ${\mathbf{U}^{(2)}}$ :

[TABLE]

For small $N$ and $R$ , we can use the Kronecker product $\otimes$ and vectorization operations to rewrite (7) as a linear vector equation

[TABLE]

This is a square $NR\times NR$ linear system in $NR$ unknowns. However, for large $N$ , it is impractical to even form this matrix. Therefore, we solve the Sylvester equation (7) via the matrix-valued method of conjugate gradients (CG). Note that we do not need to form the $N\times N$ matrices $\mathbf{L}_{k}=\mathbf{X}_{k}\mathbf{X}_{k}^{\intercal}$ to use matrix-free CG.

3.3 Temporal Modes ${\mathbf{U}^{(3)}}$

Finally, we derive the gradients of $L$ with respect to ${\mathbf{U}^{(3)}}$ . The equations for ${\mathbf{U}^{(3)}}$ , through which we define the $\mathbf{D}^{(k)}$ , are similar except in this case the unknown is a diagonal matrix. The loss (2) with respect to $\mathbf{D}^{(k)}$ is decoupled across windows, which leads to decoupled gradients as well. Replacing $\mathbf{Y}\to\mathbf{Y}_{k},\mathbf{A}\to{\mathbf{U}^{(1)}},\mathbf{C}\to{\mathbf{U}^{(2)}}^{\intercal}\mathbf{X}_{k}$ , and $\mathbf{B}\to\mathbf{D}^{(k)}$ in (3) leads to

[TABLE]

where $*$ is the elementwise or Hadamard product, which enforces the diagonal constraint on $\mathbf{D}^{(k)}$ and its gradient, and $\mathbf{L}_{k}^{\prime}={\mathbf{U}^{(2)}}^{\intercal}\mathbf{X}_{k}\mathbf{X}_{k}^{\intercal}{\mathbf{U}^{(2)}}$ and $\mathbf{R}_{k}^{\prime}={\mathbf{U}^{(1)}}^{\intercal}{\mathbf{U}^{(1)}}$ are $R\times R$ matrices.

By Theorem 2.5 in [27], we have that

[TABLE]

where we have substituted $\mathbf{D}^{(k)}=\mathrm{diag}(\mathbf{u}^{(3)}_{k:})$ . Thus the gradient can be written as

[TABLE]

where $\mathrm{vecdiag}(\mathbf{M})$ is the column vector formed by the diagonal elements of $\mathbf{M}$ . Thus, the least-squares problem for ${\mathbf{U}^{(3)}}$ involves solving an $R\times R$ system for each row $\mathbf{u}^{(3)}_{k:}$ of ${\mathbf{U}^{(3)}}$ independently, $k=1,\ldots,T$ . However, adding regularization to the temporal modes destroys this decoupling across time windows, which is why we resort to CG or proximal gradient approaches in Algorithm 1.

3.4 Computational complexity of alternating least squares

Each minimization step requires the solution of a linear matrix equation $\frac{\partial L}{\partial{\mathbf{U}^{(d)}}}=0$ for ${\mathbf{U}^{(d)}}$ . We now comment on the difficulty of solving the subproblems in ${\mathbf{U}^{(1)}}$ , ${\mathbf{U}^{(2)}}$ , and ${\mathbf{U}^{(3)}}$ . The subproblem in ${\mathbf{U}^{(1)}}$ is by far the easiest, since it involves just one $R\times R$ system solve or pseudoinverse. The subproblem in ${\mathbf{U}^{(3)}}$ requires $T$ decoupled $R\times R$ system solves. The cost of this step is thus linear in $T$ . We assume that $R$ is small enough that solving for each time window is fast. By far the most difficult step is the subproblem for the right hand factors ${\mathbf{U}^{(2)}}$ . This challenge is in part because we obtain a Sylvester equation which we might naively solve using the Kronecker reformulation as an $NR\times NR$ square system in $NR$ unknowns. When we avoid using Kronecker products and instead use matrix-free CG, we can be assured that this will take less memory but could take more time, depending on the condition number of the Kronecker product matrix (8). However, in either case the subproblem involves coefficient matrices of size $N$ , whereas for the other problems these are of size $R$ .

4 Regularization

Alternating least squares is a natural approach to the TVART problem as formulated. However, we have found that ALS is numerically unstable for the switching linear test problem (Sec. 5.1). This instability is worst with low or zero observation noise, but it is alleviated when larger noise added to the data. Additive, independent noise adds a diagonal component to the data covariance, which suggests we apply Tikhonov regularization to the problem. An additional motivation for Tikhonov regularization is that we do not want the entries in the matrices ${\mathbf{U}^{(1)}}$ , ${\mathbf{U}^{(2)}}$ , and ${\mathbf{U}^{(3)}}$ to become too large, but some might become large due to the scaling indeterminancy of the CP decomposition [22]. We thus add a term

[TABLE]

to the least-squares loss. The parameter $\eta$ controls the magnitude of this Tikhonov regularization: larger $\eta$ is a smaller penalty, and smaller $\eta$ is a stronger penalty. In matrix completion problems, a similar two-term regularization is often added and can be seen as a convex relaxation of the matrix rank.

4.1 Temporal smoothing

We also consider further regularization of the temporal modes. Recall that the rows of ${\mathbf{U}^{(3)}}$ correspond to the loadings at different time windows. By forcing these rows to be correlated, we keep the system matrices $\mathbf{A}_{k}$ from varying too much from window to window, a form of temporal smoothing. The first regularizer we consider is a total variation (TV) penalty:

[TABLE]

Matrix $\mathbf{D}$ is the $(T-1)\times T$ first difference matrix with free boundary conditions:

[TABLE]

TV prefers piecewise constant time components $\mathbf{u}^{(3)}_{:r}$ since it penalizes nonzero first-differences with column-wise $\ell_{1}$ penalty $\|\cdot\|_{1}$ to enforce sparsity, appropriate for an SLDS.

Alternatively, we consider a spline penalty:

[TABLE]

This linear smoother penalizes the $\ell_{2}$ -norm of the first derivative, leading to smoothly varying solutions.

4.2 Regularized cost function

We modify the problem (TVART), adding the Tikhonov and smoothing penalties to the loss function. These additions result in the regularized cost function

[TABLE]

where $\mathbf{A}_{k}$ follows equation (1) as before. Here, $\mathcal{R}(\cdot)$ is either TV $(\cdot)$ or Spline $(\cdot)$ . Increasing the temporal smoothing strength $\beta$ leads to stronger regularization, as does decreasing $\eta$ .

4.3 Alternating minimization algorithm

We solve the regularized problem TVART* using alternating minimization, also known as block coordinate descent, detailed in Algorithm 1. The subroutines that minimize for ${\mathbf{U}^{(1)}},{\mathbf{U}^{(2)}}$ and ${\mathbf{U}^{(3)}}$ require different approaches. Since the objective TVART* is quadratic in ${\mathbf{U}^{(1)}}$ and ${\mathbf{U}^{(2)}}$ , we find these by solving a linear matrix equation either directly or using CG; CG works best for solving the Sylvester equation in ${\mathbf{U}^{(2)}}$ . For ${\mathbf{U}^{(3)}}$ with the Spline penalty, the cost is again quadratic so we also use CG. However, the TV penalty is convex but not smooth, so in this case we use the proximal gradient method with Nesterov acceleration [5].

Algorithm 1 has similar complexity to ALS for the unregularized problem. However, in the regularized case the cost is dominated by the minimization over ${\mathbf{U}^{(3)}}$ , which requires computing the proximal operator at each step of the proximal gradient method.

The following theorem proves that when Algorithm 1 converges, it converges to a local minimum of the cost:

Theorem 4.1 (Convergence of alternating minimization).

*For a convex regularizer $\mathcal{R}$ , the sequence of iterates generated by Algorithm 1 is defined and bounded, and every cluster point is a coordinatewise minimum, i.e. a Nash point, of the regularized cost TVART*. *

Proof 4.2.

We use the framework of [38, Theorem 5.1] for cyclic block coordinate descent, of which Algorithm 1 is an example. This theorem requires objective functions with convex and lower semicontinuous blocks as well as bounded level sets. We split the cost into smooth and non-smooth parts

[TABLE]

where $f_{1}({\mathbf{U}^{(3)}})=\beta\mathcal{R}({\mathbf{U}^{(3)}})$ and $f_{0}$ contains the remaining loss and Tikhonov terms. The function $f_{0}$ is continuous and differentiable, and it is $\frac{1}{\eta}$ -strongly convex in each of its blocks ${\mathbf{U}^{(1)}},{\mathbf{U}^{(2)}}$ , and ${\mathbf{U}^{(3)}}$ . However, $f_{0}$ is not a convex function. Also, $f_{1}$ is convex and continuous. Let $(\mathbf{U}_{0}^{(1)},\mathbf{U}_{0}^{(2)},\mathbf{U}_{0}^{(3)})$ be the initialization and $a=C(\mathbf{U}_{0}^{(1)},\mathbf{U}_{0}^{(2)},\mathbf{U}_{0}^{(3)})$ . Denote the level set

[TABLE]

*Then, since $C({\mathbf{U}^{(1)}},{\mathbf{U}^{(2)}},{\mathbf{U}^{(3)}})\geq\frac{1}{2\eta}\left(\|{\mathbf{U}^{(1)}}\|_{F}^{2}+\|{\mathbf{U}^{(2)}}\|_{F}^{2}+\|{\mathbf{U}^{(3)}}\|_{F}^{2}\right)$ and the ball $B_{0}(r)=\{x:\|{\mathbf{U}^{(1)}}\|_{F}^{2}+\|{\mathbf{U}^{(2)}}\|_{F}^{2}+\|{\mathbf{U}^{(3)}}\|_{F}^{2}\leq r\}$ is bounded, we can conclude that the level set $S_{a}\subseteq B_{0}(2\eta a)$ is also bounded. Then by [38, Theorem 5.1], we obtain the result. *

Remark 4.3.

*We did not use any structure of $\mathcal{R}$ besides convexity and lower semicontinuity, thus the same convergence results hold for other regularizations with those properties. However, we did need to use the Tikhonov penalty to ensure that the level sets are bounded. *

4.4 Implementation details

The code and instructions for running it are available from https://github.com/kharris/tvart. We implemented Algorithm 1 in MATLAB. It was run on an Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz, 1200 MHz with 32 cores and 128 GB RAM using Ubuntu 16.04.1 with Linux kernel 4.15.0-48-generic and 64-bit MATLAB R2017b. The CG and prox-gradient subroutines are limited to 24 and 40 iterates, respectively. Proximal gradient uses a backtracking line search to find the step size and the proximal operator of TV is evaluating using prox_tv1d from UNLocBox [30], and this is typically the bottleneck step for large problems. All hyperparameter tuning (for $R$ , $M$ , $\eta$ , and $\beta$ ) was performed manually.

For initialization we use the following method. We consider the entire timeseries in two snapshot matrices,

[TABLE]

We then fit a single linear model to the timeseries by $\mathbf{A}=\mathbf{Y}\mathbf{X}^{\dagger}$ and form the matrices $[\mathbf{U},\mathbf{S},\mathbf{V}]=\mathrm{svd}(\mathbf{A})$ . When $R=N$ , the TVART modes are then initialized to $\mathbf{U}_{0}^{(1)}=\mathbf{U}$ , $\mathbf{U}_{0}^{(2)}=\mathbf{V}$ , and $\mathbf{U}_{0}^{(3)}=\mathbf{1}_{T}\mathbf{1}_{R}^{\intercal}/\sqrt{T}$ is the matrix of all-ones with columns normalized. If $R<N$ , we truncate the smaller singular vectors, and when $R>N$ , we add columns equal to $\mathbf{1}_{N}/\sqrt{N}$ to $\mathbf{U}_{0}^{(1)}$ and $\mathbf{U}_{0}^{(2)}$ and $\mathbf{1}_{T}/\sqrt{T}$ to $\mathbf{U}_{0}^{(3)}$ . Initializations with unequal columns appear to help speed up the initial phase of the optimization, as opposed to all-ones. For this reason, we add Gaussian noise to the initializations with standard deviation $0.5/\sqrt{N}$ to $\mathbf{U}_{0}^{(1)}$ and $\mathbf{U}_{0}^{(2)}$ and with standard deviation $0.5/\sqrt{T}$ to $\mathbf{U}_{0}^{(3)}$ . Initializing to zeros is not appropriate since the gradients in that case are always zero. All of the results we present do not depend strongly on the realization of the noise.

For stopping criteria, we use both a relative and absolute tolerance on the decrease in the cost function. Let $C_{t}$ be the cost of TVART* at iterate $t$ . Then, we stop if either

[TABLE]

Unless specified otherwise, rtol = $10^{-4}$ and atol = $10^{-6}$ . In all cases we have tested, the relative tolerance is acheived first. We report the cost TVART* as it runs as well as the root mean square error (RMSE), which we define as:

[TABLE]

The RMSE is normalized so that it gives a measure of average one-step prediction error per channel. This “goodness of fit” metric can then be compared to the standard deviation of the data.

5 Example applications

In this section we test our method on both synthetic data and real-world datasets. With synthetic data generated by a switching or smoothly varying linear dynamical system, we show that TVART can recover the true dynamics and is competitive with other state-of-the-art techniques. In real-world data examples, we highlight how the recovered modes are interpretable and can correspond to important dynamical regimes. For a complete list of the parameters used, please refer to Appendix B.

5.1 Test problem 1: switching low rank linear system

5.1.1 Model and data generation

We first apply TVART to a simple test case where the true model is a low rank, switching linear system. We generate two $N\times N$ system matrices $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ , which are random, rank-2 rotation matrices. For the first half of our timeseries, the dynamics follow $\mathbf{A}_{1}$ , and then they switch to $\mathbf{A}_{2}$ . Specifically, we form the matrices $\mathbf{A}_{i}$ for $i=1,2$ by the following process:

Generate a $2\times 2$ rotation matrix $\mathbf{A}_{i}^{\prime}=\left(\begin{array}[]{lr}\cos(\theta_{i})&-\sin(\theta_{i})\\ \sin(\theta_{i})&\cos(\theta_{i})\end{array}\right)$ . 2. 2.

Draw a random matrix $\mathbf{Z}_{i}\in\mathbb{R}^{N\times 2}$ , with standard Gaussian entries. 3. 3.

Take the SVD: $[\mathbf{W}_{i},\mathbf{S}_{i},\mathbf{V}_{i}]=\mathrm{svd}(\mathbf{Z}_{i})$ . 4. 4.

Form a larger rotation matrix by projecting with onto the left singular vectors: $\mathbf{A}_{i}=\mathbf{W}_{i}\mathbf{A}_{i}^{\prime}\mathbf{W}_{i}^{\intercal}$ .

Then, $\mathbf{A}_{i}$ is a random rotation matrix of rank 2. We use $\theta_{1}=0.1\pi$ and $\theta_{2}=0.37\pi$ .

The trajectory is generated by starting with the initial condition $\mathbf{x}(1)=\mathbf{1}$ and iterating 200 steps with $\mathbf{A}_{1}$ to remove a transient. After the transient, the vector is renormalized to have $\|\mathbf{x}(t)\|=\sqrt{N}$ . For the first half of the timeseries $t<\tau/2$ , we iterate $\mathbf{x}(t+1)=\mathbf{A}_{1}\,\mathbf{x}(t)$ , while for the second half $t\geq\tau/2$ we iterate $\mathbf{x}(t+1)=\mathbf{A}_{2}\,\mathbf{x}(t)$ . After the first application of $\mathbf{A}_{2}$ , we again rescale $\mathbf{x}(t)$ to have norm $\sqrt{N}$ .

After generating the noiseless trajectory, we add independent, Gaussian observation noise with standard deviation $\sigma$ to each entry of $\mathbf{x}(t)$ . Since the noise vector norm scales as $\sqrt{N}$ , rescaling the signal vector by the same factor allows us to maintain the same signal-to-noise ratio across different system sizes $N$ . The RMSE of a model that is not overfit should be approximately $\sigma$ for any $N$ .

5.1.2 Performance of TVART

We present results in Figure 2 for $N=10$ and $\sigma=0.5$ . The TVART algorithm is run with $M=20$ , $R=8$ , $\eta=1/N$ , $\beta=5$ , and TV regularization. Algorithm 1 converges in 30 iterations with an RMSE of 0.554, which is close to the noise floor $\sigma=0.5$ . The temporal modes are stable for the first 5 windows, when system $\mathbf{A}_{1}$ is active, then switch to a different state for the remaining windows where $\mathbf{A}_{2}$ is active. Thus just from the temporal modes, we can determine the change point to the resolution of the window size. Examining the TVART output in more detail, only the first four spatial and temporal modes are significantly different from 0. Remarkably, TVART is able to discover the true rank of the system; this is 4 because $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ are each rank 2. Furthermore, we compare the TVART reconstruction of $\mathbf{A}_{1}$ to the truth; the reconstruction matches the truth very closely (Figure 2).

We now describe some other behaviors that have been observed in this test problem for parameters that are not shown. When the noise $\sigma$ is very small or zero, the Tikhonov regularization term is important. If we eliminate this term by taking $\eta\to\infty$ , the ALS subproblems become ill-conditioned and lead to poor results. Without the Tikhonov penalty, the true rank of the system is also less obvious—the entries which are nearly 0 in Figure 2 are noisier—whereas the penalty shrinks these noisy entries. The TV regularization, in contrast, is important for selecting a switching solution under moderate to large noise.

5.1.3 Comparison to other methods

We compared TVART method to two alternative approaches: independent windowed fits using multivariate regression, for a simple baseline, as well as a Bayesian SLDS fit using the ssm package available from https://github.com/slinderman/ssm.

The independent model is tested with full rank and rank truncation to 4 or 6, using the SVD. To fit a rank $R<N$ model on the $k$ th window of data, we first perform the SVD on the entire $k$ th timeseries, fit a linear model for the $R$ leading temporal modes, and project back into the $N$ -dimensional space using the spatial modes.

The SLDS fit was allowed 6000 iterations and performed with either a 4 or 6 dimensional latent space with the maximum number of switches parameter set to Kmax = 4 and otherwise default parameter values.

For TVART we use a rank of either 4 or 6 and the parameters $M=20,\eta=1/N$ , $\beta=1$ , and TV regularization. Recall that the true rank of the system is 4.

In Figure 3, we show the results as we sweep across system sizes $N$ from 10 to 4000. We depict the error in the reconstruction $\|\mathbf{A}-\hat{\mathbf{A}}\|_{2}$ averaged across all time windows. Note that if we measure the entrywise error in the reconstruction, this decreases with $N$ , whereas the operator norm error is constant; the relative trends remain the same in any norm. We see that TVART of either rank is able to recover the true dynamics $\mathbf{A}$ better than any of the other methods. SLDS performs worse but slightly better than independent for rank 4, rank 6 is similar, although both degrades for higher system sizes $N$ . TVART is also much faster than SLDS, by roughly an order of magnitude.

These results are with the current version of ssm with the default settings. A previous version of this package with similar settings did perform better (results are shown in an earlier version of this paper, available at https://arxiv.org/abs/1905.08389v1). However, in that case, TVART was still as good or better than the alternative methods while running much faster. We conclude that TVART with TV regularization is a scalable, alternative way of finding low rank switching linear dynamics.

5.2 Test problem 2: smoothly varying low rank linear system

5.2.1 Model and data generation

Another comparison was made by assuming rank 2 orthogonal dynamics that are smooth. In the previous switching linear example, the dynamics were generated from a rotation matrix for two different angles $\theta_{1}$ and $\theta_{2}$ . In this case, the angle $\theta(t)$ comes from a smooth Gaussian process in order to have slowly-varying rank 2 dynamics. In this case, the orthogonal projection into the higher space $\mathbf{W}$ is fixed for all time.

In detail, we form a different system matrix at each time step $t$ by

[TABLE]

where the random basis $\mathbf{W}$ is drawn as before. The angle $\theta(t)$ is from a centered Gaussian process with covariance function

[TABLE]

This squared exponential covariance produces smooth trajectories $\theta(t)$ that are approximately constant over the timescale of $\sim 30$ time steps. The small diagonal covariance ensures numerical stability of the Cholesky decomposition. We add Gaussian observation noise with standard deviation $\sigma=0.2$ .

5.2.2 Results

We depict the data for a realization with $N=10$ in Figure 4. This includes the angle $\theta(t)$ drawn from the Gaussian process and observations of the dynamical system output. It is clear that the frequency of oscillations in the state variables are stable over short timescales, but that the frequency is modulated as $\theta(t)$ changes.

TVART is applied to these data with $R=4$ , $M=1$ , $\eta=6/N$ , and $\beta=600\,\log_{10}^{2}(N)$ with the Spline regularization. The temporal modes are shown as well as a comparison of $\mathbf{A}(t)$ versus the inferred $\hat{\mathbf{A}}(t)$ at $t=1$ . Interestingly, with these values of regularization, we can fit an accurate model at every time point without overfitting. The recovered system is effectively rank 3 selected by the Tikhonov penalty.

We also tested the method for other values of $N$ varying between 6 and 4000. The results are not shown, but the performance of TVART is stable across system sizes, with $\|\mathbf{A}-\hat{\mathbf{A}}\|_{2}\approx 0.1$ . Independent fits perform significantly worse, and SLDS is even poorer. This is not surprising, since an independent model is sure to overfit while the SLDS model is inherently not smooth and thus highly biased.

5.3 Dataset 1: worm behavior

We analyze the escape response behavior of the nematode worm Caenorhabditis elegans in response to a heat stimulus [6, 7]. These were used as test data for clustering via adaptive linear models in the recent work of [13]. Worm postural data were analyzed as smooth timeseries of $N=4$ “eigenworm” principal components. We ran TVART with an affine model and $R=6,M=6,\eta=0.05$ , and $\beta=6$ on these data. We also compared the performance of the code provided by [13] at https://github.com/AntonioCCosta/local-linear-segmentation.

The results for worm 1 are shown in Figure 5. We performed clustering on the system matrices as before with three clusters and found that these clusters matched the three behaviors in the data: forward crawling, a turn, and backward crawling. These clusters are not trivial: the data means during forward and backward motion are approximately equal, but the phase velocity switches.

Finally, we also compared our results to the code provided by [13], which fits an adaptive linear model and clusters the same timeseries; the clustering results are essentially the same. In terms of runtime, fitting a TVART model, performing clustering, and displaying the results takes 23 s (90 iterates), versus 123 s for the code of [13]. Thus, we see that our method is much faster than theirs, while producing essentially the same clustering results.

5.4 Dataset 2: sea surface temperature

We applied our method to weekly sea surface temperature data from 1990 until present [32]. Sea surface temperature data contain oscillations with varying timescales, from seasonal to multi-decadal. This is also a very high-dimensional test for the TVART algorithm.

Weekly sea surface temperature grids were downsampled by a factor of 6 in the latitudinal and longitudinal directions, resulting in final vectors of length $N=1259$ . We chose a window size of $M=9$ weeks, resulting in $T=171$ windows, and use parameters $\eta=10^{-3}$ , $\beta=10^{4}$ , and the Spline regularizer. Using our standard initialization, the ALS routine stagnates as the matrix ${\mathbf{U}^{(2)}}$ converges very slowly. However, we have found that restarting the algorithm after 10 iterations and setting ${\mathbf{U}^{(2)}}={\mathbf{U}^{(1)}}$ in the new initialization speeds things up significantly. This heuristic was inspired by the fact that the solution has ${\mathbf{U}^{(2)}}\approx{\mathbf{U}^{(1)}}$ . In the end, the algorithm requires 1176 iterations to converge.

The leading modes that are output by the algorithm oscillate seasonally. However, we also find that mode 5, as ordered by $\ell_{2}$ energy, tracks the El Niño-Southern Oscillation (ENSO), Figure 6. The corresponding spatial modes show a plume of warm water in the central and eastern equatorial Pacific Ocean. Warmer than average water in this location is the signature of ENSO. Thus, TVART is able to discover an important dynamical feature in a large spatiotemporal dataset.

5.5 Dataset 3: neural activity during a reaching task

A second high-dimensional dataset comes from electrocorticography recordings (ECoG) of a Japanese macaque monkey Macaca fuscata during a reaching task, provided by the NeuroTycho project [12]. This is an invasive technology that measures brain activity using chronic electrodes placed below the skull and the dura mater. During the task, the monkey makes repeated reaches towards food while its limbs are tracked using motion capture and brain activity is recorded.

We analyzed the data of monkey K1, date 2009-05-25. We ran TVART on the $N=64$ channel ECoG voltage data after filtering out 50 Hz line noise, downsampling to 500 Hz, and standardization. The TVART parameters were $M=200$ , $R=8$ , $\eta=1$ , $\beta=100$ , and TV regularization, resulting in $T=1999$ windows of length 0.4 s.

Figure 7 depicts the results: We show motion capture data with dashed lines at the onset and offsets of movements, using a changepoint detection procedure. We also highlight the result of clustering the TVART temporal modes, obtained from the brain activity alone, into two clusters. The brain activity reveals two dominant modes, one of which is aligned to movement. These movements are accompanied by an increase in high frequency (32-200 Hz) and a decrease in low frequency (2-32 Hz) power. The dominant TVART mode (show in blue) follows this spectral change in the timeseries. TVART tracks this spectral change directly from the timeseries of electrode voltage. Furthermore, the spatial modes associated with it are centered in the premotor region, as depicted in Figure 8. Again, we observe that an important spatiotemporal feature has been extracted via the TVART algorithm.

6 Conclusions

We have presented a low rank tensor formulation of dynamical linear modeling that we call time-varying autoregression with low rank tensors, or TVART. Our method offers the advantage of being able to scale to problems with many state variables or many time points, as highlighted by our examples. It also allows for incorporating prior knowledge of the temporal structure of the dynamics via regularization, which can lead to more stable and accurate reconstructions from less data. We have shown that enforcing different kinds of temporal smoothness aids in the identification of switching or slowly-varying dynamics. We have also shown that TVART can be used with a variety of real datasets, and the modes output by our method are often interpretable as dominant dynamical features in the data.

There are limitations to TVART. First, it has a number of hyperparameters, such as the latent rank, stopping tolerances, and regularization strengths, that must be tuned to the problem at hand. While our method is generally more scalable than Bayesian approaches, it suffers from fragility to hyperparameter choice. In our experiments, it seems that with true underlying linear dynamics (i.e. in the synthetic test cases), good results with rapid convergence may be obtained using a wide range of parameters. On real datasets, tuning these took more effort, although here we limited our tuning to the reuglarization parameters $\eta$ and $\beta$ .

A second limitation is the potential sensitivity of TVART to initialization. Due to non-convexity, there is no guarantee that Algorithm 1 converges to a global optimum. In our experiments, we only found problems with this in the SST and ECoG datasets, and adding noise to the initialization helped find better solutions. We have verified that our results hold qualitatively across multiple random initializations, as is standard with other tensor decompositions [22]. This is our recommended approach.

Finally, we have noted that the convergence sometimes stagnates for large problems, in particular the SST and ECoG datasets. This seems to affect the right spatial modes ${\mathbf{U}^{(2)}}$ more severely than the others. This may be due to the use of matrix-valued CG for that subproblem, but it also seems likely that the subproblem for the right spatial modes is inherently more ill-conditioned with correlated data. Multiresolution methods [29] or explicit regularization of spatial modes could improve convergence and produce more interpretable results with large and spatially-correlated datasets.

Future work with TVART should investigate these and other possible extensions. Higher-order autoregressive models [40], which incorporate multiple time delays stacked into Hankel matrices [35, 19], are a natural next step which would allow for wider applications, e.g. to economic data. Theoretically, it would be good to have more understanding of what linear models of nonlinear dynamics precisely find. It has recently been proven that some other non-convex factorization problems have no spurious local minima [e.g. 17]. Thus, it would be interesting to know whether there may exist similar results for properly constrained CP decomposition models. This might lead to modified versions of TVART with guaranteed optimality.

Acknowledgements

Thank you to Scott Linderman and Nathan Kutz for discussions and to Steven Peterson for help preprocessing the NeuroTycho data. KDH was supported by a Washington Research Foundation Postdoctoral Fellowship. BWB, RR, and KDH were supported by NSF CNS award 1630178 and DARPA award FA8750-18-2-025.

Appendix A Code availability

MATLAB code implementing TVART and instructions for running it are available from https://github.com/kharris/tvart.

Appendix B Example application details

A complete table of parameters used for the test problems and datasets is given in Table 1.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. A. Alexander, L. Matrosova, C. Penland, J. D. Scott, and P. Chang , Forecasting Pacific SS Ts: Linear Inverse Model Predictions of the PDO , Journal of Climate, 21 (2008), pp. 385–402, https://doi.org/10.1175/2007 JCLI 1849.1 . · doi ↗
2[2] N. A. Allgaier, K. D. Harris, and C. M. Danforth , Empirical correction of a toy climate model , Physical Review E, 85 (2012), https://doi.org/10.1103/Phys Rev E.85.026201 , https://arxiv.org/abs/1107.2690 . · doi ↗
3[3] N. Anari, C. Daskalakis, W. Maass, C. Papadimitriou, A. Saberi, and S. Vempala , Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons , in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Curran Associates, Inc., 2018, pp. 10857–10867.
4[4] M. T. Bahadori, Q. R. Yu, and Y. Liu , Fast Multivariate Spatio-temporal Analysis via Low Rank Tensor Learning , in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., Curran Associates, Inc., 2014, pp. 3491–3499.
5[5] A. Beck and M. Teboulle , A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , SIAM Journal on Imaging Sciences, 2 (2009), pp. 183–202, https://doi.org/10.1137/080716542 . · doi ↗
6[6] O. D. Broekmans, J. B. Rodgers, W. S. Ryu, and G. J. Stephens , Data from: Resolving coiled shapes reveals new reorientation behaviors in C. elegans , Dryad Digital Repository, (2016), https://doi.org/10.5061/dryad.t 0m 6p . · doi ↗
7[7] O. D. Broekmans, J. B. Rodgers, W. S. Ryu, and G. J. Stephens , Resolving coiled shapes reveals new reorientation behaviors in C. elegans , e Life, 5 (2016), p. e 17227, https://doi.org/10.7554/e Life.17227 . · doi ↗
8[8] B. W. Brunton, L. A. Johnson, J. G. Ojemann, and J. N. Kutz , Extracting spatial–temporal coherent patterns in large-scale neural recordings using dynamic mode decomposition , Journal of Neuroscience Methods, 258 (2016), pp. 1–15, https://doi.org/10.1016/j.jneumeth.2015.10.010 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Time-varying Autoregression with Low Rank Tensors

Abstract

1 Introduction

Notation

2 The TVART model

Definition 2.1**.**

2.1 Extensions: affine dynamics and higher-order autoregressions

3 Alternating least squares algorithm

3.1 Left Spatial Modes U(1){\mathbf{U}^{(1)}}U(1)

3.2 Right Spatial Modes U(2){\mathbf{U}^{(2)}}U(2)

3.3 Temporal Modes U(3){\mathbf{U}^{(3)}}U(3)

3.4 Computational complexity of alternating least squares

4 Regularization

4.1 Temporal smoothing

4.2 Regularized cost function

4.3 Alternating minimization algorithm

Theorem 4.1** (Convergence of alternating minimization).**

Proof 4.2**.**

Remark 4.3**.**

4.4 Implementation details

5 Example applications

5.1 Test problem 1: switching low rank linear system

5.1.1 Model and data generation

5.1.2 Performance of TVART

5.1.3 Comparison to other methods

5.2 Test problem 2: smoothly varying low rank linear system

5.2.1 Model and data generation

5.2.2 Results

5.3 Dataset 1: worm behavior

5.4 Dataset 2: sea surface temperature

5.5 Dataset 3: neural activity during a reaching task

6 Conclusions

Acknowledgements

Appendix A Code availability

Appendix B Example application details

Definition 2.1.

3.1 Left Spatial Modes ${\mathbf{U}^{(1)}}$

3.2 Right Spatial Modes ${\mathbf{U}^{(2)}}$

3.3 Temporal Modes ${\mathbf{U}^{(3)}}$

Theorem 4.1 (Convergence of alternating minimization).

Proof 4.2.

Remark 4.3.