Tensor Graphical Lasso (TeraLasso)

Kristjan Greenewald; Shuheng Zhou; Alfred Hero III

arXiv:1705.03983·stat.ME·September 24, 2019

Tensor Graphical Lasso (TeraLasso)

Kristjan Greenewald, Shuheng Zhou, Alfred Hero III

PDF

TL;DR

TeraLasso is a scalable tensor generalization of the Bigraphical Lasso that accurately estimates high-dimensional precision matrices with limited data, revealing conditional dependencies in multiway data such as space and time.

Contribution

This paper introduces TeraLasso, a novel tensor graphical model with a scalable estimation algorithm, extending the Bigraphical Lasso to multiway data with theoretical guarantees.

Findings

01

Accurately estimates precision matrices from limited high-dimensional data.

02

Recovers meaningful conditional dependency graphs in complex datasets.

03

Proven statistical consistency and convergence rates for the estimators.

Abstract

This paper introduces a multi-way tensor generalization of the Bigraphical Lasso (BiGLasso), which uses a two-way sparse Kronecker-sum multivariate-normal model for the precision matrix to parsimoniously model conditional dependence relationships of matrix-variate data based on the Cartesian product of graphs. We call this generalization the {\bf Te}nsor g{\bf ra}phical Lasso (TeraLasso). We demonstrate using theory and examples that the TeraLasso model can be accurately and scalably estimated from very limited data samples of high dimensional variables with multiway coordinates such as space, time and replicates. Statistical consistency and statistical rates of convergence are established for both the BiGLasso and TeraLasso estimators of the precision matrix and estimators of its support (non-sparsity) set, respectively. We propose a scalable composite gradient descent algorithm and…

Figures40

Click any figure to enlarge with its caption.

Equations742

p = \prod_{k = 1}^{K} d_{k} and m_{k} = \prod_{i \neq = k} d_{i} = \frac{p}{d _{k}} .

p = \prod_{k = 1}^{K} d_{k} and m_{k} = \prod_{i \neq = k} d_{i} = \frac{p}{d _{k}} .

I_{[d_{k : ℓ}]} = ℓ - k + 1 factors I_{d_{k}} \otimes \dots \otimes I_{d_{ℓ}}

I_{[d_{k : ℓ}]} = ℓ - k + 1 factors I_{d_{k}} \otimes \dots \otimes I_{d_{ℓ}}

Ψ_{1} \oplus \dots \oplus Ψ_{K} = k = 1 \sum K I_{[d_{1 : k - 1}]} \otimes Ψ_{k} \otimes I_{[d_{k + 1 : K}]} .

Ψ_{1} \oplus \dots \oplus Ψ_{K} = k = 1 \sum K I_{[d_{1 : k - 1}]} \otimes Ψ_{k} \otimes I_{[d_{k + 1 : K}]} .

Ω = ar g Ω \in K_{p}^{♯}, ∥Ω ∥_{2} \leq κ min ⎩ ⎨ ⎧ - lo g ∣Ω∣ + ⟨ S, Ω ⟩ + k = 1 \sum K m_{k} i \neq = j \sum g_{ρ_{k}} ([Ψ_{k}]_{ij}) ⎭ ⎬ ⎫

Ω = ar g Ω \in K_{p}^{♯}, ∥Ω ∥_{2} \leq κ min ⎩ ⎨ ⎧ - lo g ∣Ω∣ + ⟨ S, Ω ⟩ + k = 1 \sum K m_{k} i \neq = j \sum g_{ρ_{k}} ([Ψ_{k}]_{ij}) ⎭ ⎬ ⎫

where S = \frac{1}{n} i = 1 \sum n vec (X_{i}^{T}) vec (X_{i}^{T})^{T},

K_{p}^{♯}

K_{p}^{♯}

K_{p}

K_{p}

K_{p}^{♯}

K_{p}^{♯}

Proj_{K_{p}} (A) = ar g M \in K_{p} min ∥ A - M ∥_{F}^{2} .

Proj_{K_{p}} (A) = ar g M \in K_{p} min ∥ A - M ∥_{F}^{2} .

Ω

Ω

ρ_{k, ij, ℓ} = \frac{[ Ψ _{k} ] _{ij}}{([ Ψ _{k} ] _{ii} + c _{ℓ} / d _{k} ) ([ Ψ _{k} ] _{j j} + c _{ℓ} / d _{k} )}

ρ_{k, ij, ℓ} = \frac{[ Ψ _{k} ] _{ij}}{([ Ψ _{k} ] _{ii} + c _{ℓ} / d _{k} ) ([ Ψ _{k} ] _{j j} + c _{ℓ} / d _{k} )}

S_{k}

S_{k}

[S_{k}]_{ij}

[S_{k}]_{ij}

Ω = ar g Ω \in K_{p}^{♯}, ∥Ω ∥_{2} \leq κ min ⎩ ⎨ ⎧ - lo g ∣Ω∣ + k = 1 \sum K m_{k} ⟨ S_{k}, Ψ_{k} ⟩ + i \neq = j \sum g_{ρ_{k}} ([Ψ_{k}]_{ij}) ⎭ ⎬ ⎫

Ω = ar g Ω \in K_{p}^{♯}, ∥Ω ∥_{2} \leq κ min ⎩ ⎨ ⎧ - lo g ∣Ω∣ + k = 1 \sum K m_{k} ⟨ S_{k}, Ψ_{k} ⟩ + i \neq = j \sum g_{ρ_{k}} ([Ψ_{k}]_{ij}) ⎭ ⎬ ⎫

x = Σ^{1/2} v,

x = Σ^{1/2} v,

Ω

Ω

n (k min m_{k})^{2} \geq C^{2} κ (Σ_{0})^{4} (s + p) (K + 1)^{2} lo g p

n (k min m_{k})^{2} \geq C^{2} κ (Σ_{0})^{4} (s + p) (K + 1)^{2} lo g p

∥ Ω - Ω_{0} ∥_{F}

∥ Ω - Ω_{0} ∥_{F}

\frac{∥ diag ( Ω ) - diag ( Ω _{0} ) ∥ _{2}^{2}}{( K + 1 ) max _{k} d _{k}}

\frac{∥ diag ( Ω ) - diag ( Ω _{0} ) ∥ _{2}^{2}}{( K + 1 ) max _{k} d _{k}}

\leq C_{2} (K + 1) (1 + k = 1 \sum K \frac{s _{k}}{d _{k}}) \frac{lo g p}{n min _{k} m _{k}}

∥ Ω - Ω_{0} ∥_{2} \leq C_{3} (K + 1) (\frac{p}{( min _{k} m _{k} ) ^{2}}) (1 + k = 1 \sum K \frac{s _{k}}{d _{k}}) \frac{lo g p}{n} .

∥ Ω - Ω_{0} ∥_{2} \leq C_{3} (K + 1) (\frac{p}{( min _{k} m _{k} ) ^{2}}) (1 + k = 1 \sum K \frac{s _{k}}{d _{k}}) \frac{lo g p}{n} .

[i, j] \in S min ∣ [Ω_{0}]_{ij} ∣ \geq ρ (γ + 2 c_{\infty}) + c_{3} \frac{lo g p}{n min _{k} m _{k}} .

[i, j] \in S min ∣ [Ω_{0}]_{ij} ∣ \geq ρ (γ + 2 c_{\infty}) + c_{3} \frac{lo g p}{n min _{k} m _{k}} .

∥ offd (Ψ_{k} - Ψ_{0, k}) ∥_{m a x} \leq ∥ Ω - Ω_{0} ∥_{m a x} \leq c_{3} (K + 1) \frac{lo g p}{n min _{k} m _{k}},

∥ offd (Ψ_{k} - Ψ_{0, k}) ∥_{m a x} \leq ∥ Ω - Ω_{0} ∥_{m a x} \leq c_{3} (K + 1) \frac{lo g p}{n min _{k} m _{k}},

∥ offd (Ψ_{k} - Ψ_{0, k}) ∥_{F} \leq c_{3} (K + 1) \frac{s _{k} lo g p}{n min _{k} m _{k}},

∥ offd (Ψ_{k} - Ψ_{0, k}) ∥_{F} \leq c_{3} (K + 1) \frac{s _{k} lo g p}{n min _{k} m _{k}},

∥ Ω - Ω_{0} ∥_{F} \leq c_{3} (K + 1) \frac{( s + p ) lo g p}{n min _{k} m _{k}},

∥ Ω - Ω_{0} ∥_{F} \leq c_{3} (K + 1) \frac{( s + p ) lo g p}{n min _{k} m _{k}},

∥ Ω - Ω_{0} ∥_{2} \leq c_{3} d (K + 1) \frac{lo g p}{n min _{k} m _{k}} .

∥ Ω - Ω_{0} ∥_{2} \leq c_{3} d (K + 1) \frac{lo g p}{n min _{k} m _{k}} .

Q (

Q (

f (Ω)

Ω_{t + 1} \in ar g Ω \in K_{p}^{♯} min {\frac{1}{2} Ω - (Ω_{t} - ζ_{t} Proj_{K_{p}} (\nabla f (Ω_{t})))_{F}^{2} + ζ_{t} g (Ω)},

Ω_{t + 1} \in ar g Ω \in K_{p}^{♯} min {\frac{1}{2} Ω - (Ω_{t} - ζ_{t} Proj_{K_{p}} (\nabla f (Ω_{t})))_{F}^{2} + ζ_{t} g (Ω)},

\nabla_{Ω \in K_{p}} (⟨ S, Ψ_{1} \oplus \dots \oplus Ψ_{k} ⟩) =

\nabla_{Ω \in K_{p}} (⟨ S, Ψ_{1} \oplus \dots \oplus Ψ_{k} ⟩) =

S_{k}

\nabla_{Ω \in K_{p}}

\nabla_{Ω \in K_{p}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Tensor Graphical Lasso (TeraLasso)

Kristjan Greenewald

IBM Research, Cambridge, USA.

Shuheng Zhou

University of California, Riverside, USA.

Alfred Hero III

University of Michigan, Ann Arbor, USA.

Abstract

This paper introduces a multi-way tensor generalization of the Bigraphical Lasso (BiGLasso), which uses a two-way sparse Kronecker-sum multivariate-normal model for the precision matrix to parsimoniously model conditional dependence relationships of matrix-variate data based on the Cartesian product of graphs. We call this generalization the Tensor graphical Lasso (TeraLasso). We demonstrate using theory and examples that the TeraLasso model can be accurately and scalably estimated from very limited data samples of high dimensional variables with multiway coordinates such as space, time and replicates. Statistical consistency and statistical rates of convergence are established for both the BiGLasso and TeraLasso estimators of the precision matrix and estimators of its support (non-sparsity) set, respectively. We propose a scalable composite gradient descent algorithm and analyze the computational convergence rate, showing that the composite gradient descent algorithm is guaranteed to converge at a geometric rate to the global minimizer of the TeraLasso objective function. Finally, we illustrate the TeraLasso using both simulation and experimental data from a meteorological dataset, showing that we can accurately estimate precision matrices and recover meaningful conditional dependency graphs from high dimensional complex datasets.

1 Introduction

The increasing availability of matrix and tensor-valued data with complex dependencies has fed the fields of statistics and machine learning. Examples of tensor-valued data include medical and radar imaging modalities, spatial and meteorological data collected from sensor networks and weather stations over time, and biological, neuroscience and spatial gene expression data aggregated over trials and time points. Learning useful structures from these large scale, complex and high-dimensional data in the low sample regime is an important task in statistical machine learning, biology and signal processing.

As the precision matrix (inverse covariance matrix) encodes interactions and, for tensor-valued Gaussian distributions, conditional independence relationships between and among variables, multivariate statistical models, such as the matrix normal model (Dawid (1981)), have been proposed for estimation of these matrices. However, the number of parameters of the precision matrix of a $K$ -way data tensor $X\in\mathbb{R}^{d_{1}\times\dots\times d_{K}}$ grows as $\prod_{i=1}^{K}d_{i}^{2}$ . Therefore in high dimensions unstructured precision matrix estimation is impractical, requiring very large sample sizes. Undirected graphs are often used to describe high dimensional distributions. Under sparsity conditions, the graph can be estimated using $\ell$ 1-penalization methods, such as the graphical Lasso (GLasso) (Friedman et al., 2008) and multiple (nodewise) regressions (Meinshausen et al., 2006). Under suitable conditions, such approaches yield consistent (and sparse) estimation in terms of graphical structure and fast convergence rates with respect to the operator and Frobenius norm for the covariance matrix and its inverse. However, many of the statistical models that have been considered still tended to be overly simplistic and not fully reflective of reality. For example, in neuroscience one must take into account temporal correlations as well as spatial correlations, which reflect the connectivity formed by the neural pathways. Yet, this line of high dimensional statistical literature mentioned above has primarily focused on estimating linear or graphical models with i.i.d. samples. In the case of graphical models, the data matrix is usually assumed to have independent rows or columns that follow the same distribution. The independence assumptions substantially simplify mathematical derivations but they tend to be very restrictive. For instance, the cortical circuits can change over time due to activities such as motor learning, attention or visual stimulation. This data typically has a complex structure that is organized by the experiment’ s design, with one or more experimental factors varying according to a predefined pattern.

On the theoretical and methodological front, recent work demonstrated another regime where further reductions in the sample size are possible under additional structural assumptions on the conditional dependency graphs which arise naturally in the above mentioned contexts when handling data with complex dependencies. For example, the matrix-normal model as studied in Tsiligkaridis et al. (2013) and Zhou (2014) restricts the topology of the graph to tensor product graphs where the precision matrix corresponds a Kronecker product representation. Moreover, (Zhou, 2014) showed that one can estimate the covariance and inverse covariance matrices well using only one instance from the matrix-variate normal distribution. Along the same lines, the Bigraphical Lasso framework was proposed to parsimoniously model conditional dependence relationships of matrix-variate data based on the Cartesian product of graphs (Kalaitzis et al., 2013) as opposed to the direct product graphs of the matrix-normal models above. These models naturally generalize to multilinear settings with more than two axes of structure as demonstrated in the present work. The present work addresses the problem of sparse modeling of a structured precision matrix for tensor-valued data; more precisely, we aim to estimate the structure and parameters for a class of Gaussian graphical models by restricting the topology to the class of Cartesian product graphs, with precision matrices represented by a Kronecker sum for data with complex dependencies.

Toward these goals, we will introduce the tensor graphical Lasso (TeraLasso) procedure for estimating sparse $K$ -way decomposable precision matrices. We will show that our concentration of measure analysis enables a significant reduction in the sample size requirement in order to estimate parameters and the associated conditional dependence graphs along different coordinates such as space, time and experimental conditions. We establish consistency for both the Bigraphical Lasso and Tensor graphical Lasso estimators and obtain optimal rates of convergence in the operator and Frobenius norm for estimating the associated precision matrix, and for structure recovery. Finally, we demonstrate using simulations and real data that the Kronecker sum precision model has excellent potential for improving computational scalability, structural interpretation, and its applications to classification, prediction, and visualization for complex data analysis.

A philosophical motivation of TeraLasso’s Kronecker sum (Cartesian graph) model is that it achieves the maximum entropy among all models for which the tensor component projections of the covariance matrix are fixed, see Section 3. A compelling justification for the proposed Kronecker sum model for the precision matrix is that similar models have been successfully used in other fields, including regularization of multivariate splines, design of physical networks, and decomposition of solutions of partial differential equations governing many physical processes. Additional discussion of these practical motivations for the model is in Section 1.3 below.

1.1 The multi-way Kronecker sum precision matrix model

We follow the notation and terminology of Kolda and Bader (2009) for modeling tensor-valued data arrays. Define the vector of component dimensions $\mathbf{p}=[d_{1},\dots,d_{K}]$ and let $p$ denote the product of these dimensions

[TABLE]

To simplify the multiway Kronecker notation, we define

[TABLE]

where $\otimes$ denotes the Kronecker (direct) product and $\ell\geq k$ . Using this notation, the $K$ -way Kronecker sum of matrix components $\{\Psi_{k}\}_{k=1}^{K}$ can be written as

[TABLE]

In the special case of $K=2$ this Kronecker sum representation reduces to the more familiar $\Psi_{1}\oplus\Psi_{2}=\Psi_{1}\otimes I_{d_{1}}+I_{d_{2}}\otimes\Psi_{2}$ . The vectorization of a $K$ -way tensor $X$ is denoted as $\mathrm{vec}(X)$ and is defined as in Kolda and Bader (2009). Likewise, we define the transpose of a $K$ -way tensor $X^{T}\in\mathbb{R}^{d_{K}\times\dots\times d_{1}}$ analogously to the matrix transpose, i.e. $[X^{T}]_{i_{1},\dots,i_{K}}=X_{i_{K},\dots,i_{1}}$ .

When the precision matrix $\Omega$ has a decomposition of the form (1) the Kronecker sum components $\{\Psi_{k}\}_{k=1}^{K}$ are sparse, and the $K$ -way data $X$ has a multivariate Gaussian distribution, the sparsity pattern of $\Psi_{k}$ corresponds to a conditional independence graph across the $k$ -th dimension of the data.

Figure 1 illustrates the Kronecker sum model proposed in (1) for $K=3$ and $d_{k}=4$ . Specifically, $\Psi_{k},k=1,2,3$ are identical $4\times 4$ tridiagonal precision matrices corresponding to a one dimensional autoregressive-1 (AR-1) process. In the Figure the precision matrix $\Omega=\Psi_{1}\oplus\Psi_{2}\oplus\Psi_{3}$ is shown on the left and covariance $\Sigma=\Omega^{-1}$ on the right.

The entries of each $\Psi_{k}$ are replicated $m_{k}=16$ times across $\Omega$ for each $k$ . This regular structure permits the aggregation of corresponding entries in the sample covariance matrix, resulting in variance reduction in estimating $\Omega$ . This Kronecker sum gives $\Omega$ a nonseparable and interlocking repeating block structure in the covariance matrix.

We propose the following sparse Kronecker sum estimator of the precision matrix $\Omega$ in (1), which we call the Tensor Graphical Lasso (TeraLasso). The TeraLasso minimizes the negative $\ell_{1}$ -penalized Gaussian log-likelihood function over the domain $\mathcal{K}_{\mathbf{p}}^{\sharp}$ of precision matrices $\Omega$ having Kronecker sum form:

[TABLE]

$g_{\rho}(t)$ is a sparsity-inducing regularization function parameterized by a regularization parameter $\rho$ , and

[TABLE]

is the set of positive semidefinite matrices that are decomposable into a Kronecker sum of fixed factor dimensions $d_{1},\ldots,d_{K}$ . In this paper we consider $(\mu,\gamma)$ -amenable regularizers $g_{\rho}$ (Loh et al., 2017). The norm constraint $\|\Omega\|_{2}\leq\kappa$ is required for the solution to be well defined when $g_{\rho}$ is not a convex penalty. These penalties includes nonconvex regularizers such as SCAD and MCP, as well as the traditional $\ell 1$ regularizer $g_{\rho}(t)=\rho|t|$ .

Observe that sparsity in the off diagonal elements of $\Psi_{k}$ directly creates sparsity in $\Omega$ . As in the graphical Lasso, incorporating an $\ell_{1}$ -penalty over entries of $\Omega$ with the tensor-valued Gaussian or matrix-normal (pseudo)-loglikelihood promotes a sparse graphical structure in $\Omega$ ; see for example (Banerjee et al., 2008; Yuan and Lin, 2007; Zhou, 2014; Zhou et al., 2011). In this work, we allow for the more general case of nonconvex regularization functions $g_{\rho}$ as considered in Loh et al. (2017). While sometimes difficult to tune in practice, nonconvex regularization provides strong nonasymptotic guarantees on the elementwise estimation error of $\Omega$ , implying strong, single sample support recovery guarantees when the smallest nonzero element of $\Omega$ is bounded from below.

The contributions of this paper are as follows. The sparse multivariate-normal Bigraphical Lasso (BiGLasso) model is extended to the sparse tensor-variate ( $K>2$ ) TeraLasso model, allowing the modeling of data with arbitrary tensor degree $K$ . A new subgaussian concentration inequality (Corollary F.29 in the supplement) is presented that gives rates of statistical convergence (Theorems 1-3) of the TeraLasso estimator as well as the BiGLasso estimator, when the sample size is low (even equal to one). TeraLasso’s generalization of BiGlasso from 2-way to $K$ -way decompositions is important as it expands the domain of application, allowing a data scientist to group variables into their natural domains along multiple tensor axes. For example, with a health data set that is collected over space, time, people and replicates, TeraLasso’s 3-way tensor decomposition (time $\times$ space $\times$ people) can account for possible dependency structure between people, while a 2-way BiGLasso or KLasso approach decomposing over (time $\times$ space) would unnecessarily enforce an assumption of independence between people. Alternately, BiGLasso or KLasso could group two axes together (e.g. (time $\times$ space) $\times$ people), however, this would create a large, unstructured factor whose estimation would require many more replicates than the 3-way decomposition that TeraLasso uses to give each axis its own factor.

A highly scalable, first-order ISTA-based algorithm is proposed to minimize the TeraLasso objective function. We prove (Theorem H.42 in the supplement) that it converges to the global optimum with a geometric convergence rate, and demonstrate its practical advantages on high dimensional problems. As compared to the alternating block coordinate descent algorithm proposed by Kalaitzis et al. (2013) for the BiGLasso, the proposed ISTA algorithm enjoys a per-iteration computational speedup over BiGLasso of order $\Theta(p)$ . Our numerical results show that the BiGLasso algorithm often requires many more iterations to converge than our ISTA method. Numerical comparisons are presented demonstrating that TeraLasso significantly improves performance in small sample regimes. To demonstrate the application of TeraLasso to real world data we use it to estimate the precision matrix of spatio-temporal meteorological data collected by the National Center for Environmental Prediction (NCEP). Our results show that the TeraLasso precision matrix estimator degrades much more slowly than other estimators as one reduces the number of samples available to fit the model. The intuitive graphical structure, the robust eigenstructure and a maximum-entropy interpretation make the TeraLasso model a compelling choice for modeling tensor data, much as the Bigraphical Lasso provides a meaningful alternative to the matrix-normal model.

1.2 Relevant prior work

The use of tensor product models for multiway data has a long history. In the statistical context, directly fitting a Kronecker product to multiway data yields a first order approximation corresponding to fitting the mean (Kolda and Bader, 2009) when the fitting criteria is the Frobenius norm of the residuals. Many such methods involve low-rank factor decompositions including: PARAFAC and CANDECOMP as in Harshman and Lundy (1994); Faber et al. (2003); Tucker decomposition-based methods such as Tucker (1966) and Hoff (2016); and hybrid methods such as Johndrow et al. (2017). In contrast, second order methods have been used to approximate multiway structure of the covariance (Werner et al., 2008; Pouryazdian et al., 2016). Series decomposition methods have been proposed for fitting the covariance matrix in Frobenius norm using sums of Kronecker products (Tsiligkaridis and Hero, 2013; Greenewald and Hero, 2015; Rudelson and Zhou, 2017; Greenewald et al., 2017).

Kronecker product approximations to the inverse covariance have fitted matrix normal models (Allen and Tibshirani, 2010) and sparse matrix normal models (Leng and Tang, 2012; Zhou, 2014; Tsiligkaridis et al., 2013). In contrast to the Kronecker sum model (1) for the precision matrix $\Omega$ , the $K$ -way Kronecker product model is $\Omega=\Psi_{1}\otimes\ldots\otimes\Psi_{K}$ . The Kronecker product decomposition implies a separable property of the precision matrix across the $K$ data dimensions, which one might expect to become an increasingly restrictive condition as $K$ increases. In this paper we show that the proposed Kronecker sum model (1) can be a worthwhile alternative representation.

A two factor ( $K=2$ ) sparse Kronecker sum model for the precision matrix $\Omega$ was introduced and studied in Kalaitzis et al. (2013). The model was fitted to the sample covariance matrix using an iterative procedure called BiGlasso, which required the diagonal entries of $\Omega$ to be known. Conditions guaranteeing convergence were not provided. Here we extend the BiGlasso model to arbitrary $K\geq 2$ and unknown diagonal entries of $\Omega$ , provide a faster converging optimization algorithm, and obtain strong convergence guarantees and bounds on the convergence rate for all $K$ , including $K=2$ . For completeness, we also obtain (Appendix J of the supplement) bounds on the convergence rate for the known-diagonal setting of Kalaitzis et al. (2013).

The qualitative differences between the Kronecker product and Kronecker sum models for the precision matrix can be better appreciated by considering the product graphs that are induced by them. For given sparse Kronecker factors $\Psi_{1},\ldots,\Psi_{K}$ , the Kronecker product model corresponds to the direct (tensor) product of the component graphs while the Kronecker sum model corresponds to the Cartesian product111The Cartesian product of two graphs $G_{1}=(V_{1},E_{1})$ and $G_{2}=(V_{2},E_{2})$ is a graph with vertices being the Cartesian product of $V_{1}$ and $V_{2}$ , and with edges such that node $(u,u^{\prime})$ is adjacent to $(v,v^{\prime})$ if and only if either $u=v$ and $u^{\prime}$ is adjacent to $v^{\prime}$ in $G_{2}$ , or $u^{\prime}=v^{\prime}$ and $u$ is adjacent to $v$ in $G_{1}$ . of these components (Hammack et al., 2011). The direct product graph and Cartesian product graph differ greatly; the former has a number of edges equal to $\frac{1}{2}\prod_{k=1}^{K}(2|E_{k}|+|V_{k}|)-\prod_{k=1}^{K}|V_{k}|$ , while the latter has a number of edges equal to $\sum_{k=1}^{K}|E_{k}|\prod_{i\neq k}|V_{i}|$ , where $V_{i},E_{i}$ denote the node and edge sets of the $i$ -th component graph222The notation $|V_{i}|=d_{i}$ denotes the row dimension of $\Psi_{i}$ and $|E_{i}|$ denotes the number of non-zero upper triangular entries of $\Psi_{i}$ . To illustrate, if the number of non-zero entries of $\Psi_{k}$ is $cd_{i}$ for some $c$ , the number of edges induced in the direct product graph by inserting a single new edge into the first component graph is equal to $\frac{1}{2}(2c+1)^{K}(p/d_{1})-p$ , where we recall that $p=\prod_{k=1}^{K}d_{i}$ is the number of covariates (rows of $\Omega$ ). On the other hand, for the Cartesian product graph it is only $p/d_{1}$ regardless of $c$ . Hence, as $c$ and $K$ increase, using the Kronecker product model a single edge in $\Psi_{1}$ can create a proliferation of edges while the number of new edges in the Kronecker sum model is fixed, independent of $K$ . A concrete example of these differences is illustrated in Figure 2. The qualitative differences between the Kronecker product and Kronecker sum models for the precision matrix are summarized in Table 1.

1.3 Rationale for the proposed multiway Kronecker sum model

This paper develops a scalable, fast and accurate estimation procedure, the TeraLasso, for multiway precision matrices $\Omega$ using higher order Kronecker sum models. To justify the practical utility of the TeraLasso we illustrate it on a spatio-temporal meteorological dataset. We have also applied it to other applications not presented here. While comprehensive validation of the model on a larger corpus of real data is beyond the scope of this paper, there is ample evidence that the model will have many statistical applications. We base this assessment on the wide use of Kronecker sum models, equivalently Cartesian product graph models, in biology, physics, social sciences, and network engineering, among other fields (Imrich et al., 2008; Van Loan, 2000). In particular the Kronecker sum arises in solving the celebrated Sylvester equation for a matrix $X$ which, for $K=2$ , takes the form $XA+BX=N$ . The Sylvester equation can be solved by expressing the equation in vectorized form as $A\oplus B\;{\mathrm{vec}}(X)={\mathrm{vec}}(N)$ (for arbitrary $K$ this becomes the tensor Sylvester equation $(A_{1}\oplus\dots\oplus A_{K})\mathrm{vec}(X)=\mathrm{vec}(N)$ ), but this is often impractical in high dimension. Such equations result from the discretization of separable $K$ -dimensional PDEs with tensorized finite elements (Grasedyck, 2004; Kressner and Tobler, 2010; Beckermann et al., 2013; Shi et al., 2013; Ellner et al., 1986). As a result Kronecker sums come in many areas of applied math, including, beam propagation physics (Andrianov (1997)); control theory (Luenberger, 1966; Chapman et al., 2014); fluid dynamics (Dorr, 1970); and spatio-temporal neural processes (Schmitt et al., 2001).

Closer to home, the Kronecker sum model arises in multivariate spline data analysis, e.g. as applied to harmonic analysis on graphs (Kotzagiannidis and Dragotti (2017)). More recently, Fey et al. (2018) has proposed tensor B-splines defined over a Cartesian product basis for geometric Convolutional Neural Networks (CNN). Kronecker sums have been proposed as precision matrices for weighting the quadratic regularizer in smoothed multivariate spline regression. In particular, Wood (2006) observed that, as compared to the Kronecker product, the Kronecker sum reduces the coupling between the axes when used as a spline smoothing penalty for generalized additive mixed model regression. This observation motivated Wood (2006) and Eilers and Marx (2003) to use the inverse of a Kronecker sum matrix as a penalty, or prior, for smoothing $K$ -dimensional regressions (see also work by Lee and Durbán (2011) and Wood et al. (2016)). This approach has been applied to spatio-temporal forest health modeling (for which $K=3$ ) (Augustin et al., 2009), brain development modeling (Holland et al., 2014), and analysis of the impact of climate and weather on spatio-temporal patterns of beetle populations (Preisler et al., 2012), among other applications. In these spline regression problems the Kronecker sum appears as a precision matrix parameterizing a Gaussian prior on the spline coefficient vector $\beta$ , where the prior is of the form $p(\beta)\propto\exp(-\beta^{T}(\lambda_{1}S_{1}\oplus\dots\oplus\lambda_{K}S_{K})\beta/2)$ . Here, $\lambda_{i}$ are regularization coefficients and $S_{i}$ are coordinate-wise smoothing matrices, $i=1,\ldots,K$ .

Instead of using the Kronecker sum to model the a priori precision matrix of a set of spline parameters, this paper proposes the Kronecker sum as a model for the precision matrix of the multiway data in the likelihood function, where the data matrix $X$ takes the place of the spline coefficient vector $\beta$ . The stated advantages of the Kronecker sum model for the spline regression setting (Wood, 2006) can be expected to carry over to the precision matrix estimation setting of TeraLasso. In particular, like the spline regression prior, the TeraLasso smooths each axis separately, while summing over the others, thereby reducing coupling between the tensor axes as compared to the Kronecker product. For data that has structure similar to that imposed by (Wood, 2006) on the spline regression coefficients this should result in a more accurate fit. Indeed, if a population of regression spline problems was available, in principle one could apply the TeraLasso to estimating the best precision matrix of the spline coefficients that would minimize the population-averaged fitting error.

Outline. The remainder of the paper is organized as follows. We introduce notation and some preliminary results in Section 2, and our proposed TeraLasso model in Section 3. High dimensional consistency results are presented in Section 4, first with convex $\ell 1$ regularizers and then with non-convex sparsity regularizers. The first order ISTA optimization algorithm is described in Section 5, and conditions are specified for which the algorithm converges geometrically to the global optimum. Finally, Sections 6 and 7 illustrate the proposed TeraLasso estimator on simulated and real data, with Section 8 concluding the paper. We place all technical proofs in the supplementary material, along with additional experiments and further exploration of the properties and implications of the Kronecker sum subspace $\mathcal{K}_{\mathbf{p}}$ and the associated identifiable parameterization.

2 Notation and Preliminaries

We use upper case letters, e.g. $A$ for matrices and tensors, bold lower case $\mathbf{a}$ for vectors, and denote the $(i,j)$ element of a matrix $A$ as $A_{ij}$ and the $(i_{1},i_{2},\dots,i_{K})$ element of a tensor $A$ as $A_{i_{1},i_{2},\dots,i_{K}}$ . Fibers are the higher-order analogue of matrix rows and columns. A fiber of a tensor is obtained by fixing every index but one, the mode- $k$ fiber of tensor $X$ is denoted as the column vector $X_{i_{1},\dots,i_{k-1},:,i_{k+1},\dots,i_{K}}$ . Following definition by Kolda and Bader (2009), tensor unfolding or matricization of $X$ along the $k$ th-mode is denoted as $X_{(k)}$ , formed by arranging the mode- $k$ fibers as columns of the resulting matrix of dimension $d_{k}\times m_{k}$ . The column ordering is not important so long as it is consistent.

For a vector $y=(y_{1},\ldots,y_{p})$ in $\mathbb{R}^{p}$ , denote by $\left\lVert y\right\rVert_{2}=\sqrt{\sum_{j}y_{j}^{2}}$ the Euclidean norm of $y$ . The operator and Frobenius norms of a matrix $A$ are denoted as $\left\lVert A\right\rVert_{2}$ and $\left\lVert A\right\rVert_{F}$ respectively; the notation $\mathrm{vec}(A)$ denotes the vectorization of the matrix $A$ ; $\|A\|_{\infty}$ denotes the matrix infinity norm and $\|A\|_{\max}=\max_{ij}|A_{ij}|$ denotes the max norm. The determinant is denoted as $\left\lvert A\right\rvert$ . We use the inner product $\;\langle{\,A,B\,}\rangle\;={\rm tr}(A^{T}B)$ throughout. Define the set of $p\times p$ matrices with Kronecker sum structure of fixed dimensions $d_{1},\ldots,d_{K}$ :

[TABLE]

where the set of matrices defined in (4) is obtained by restricting $\mathcal{K}_{\mathbf{p}}$ to the positive cone, i.e.,

[TABLE]

Note that the set $\mathcal{K}_{\mathbf{p}}$ (5) is linearly spanned by the $K$ components, since there are no nonlinear interactions between any of the parameters. Thus $\mathcal{K}_{\mathbf{p}}$ is a linear subspace of $\mathbb{R}^{p\times p}$ , and we can define a unique projection operator onto $\mathcal{K}_{\mathbf{p}}$ :

[TABLE]

A closed-form expression for $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(A)$ is given in Section I.3 of the supplementary material. Note that the dimensionality of the $\mathcal{K}_{\mathbf{p}}$ subspace is $1-K+\sum_{k=1}^{K}d_{k}^{2}$ , which is often significantly smaller than the ambient dimension $p^{2}=\prod_{k=1}^{K}d_{k}^{2}$ .

Parameterization of $\mathcal{K}_{\mathbf{p}}$ by $\Psi_{k}$ . Note that ${\Omega}=\Psi_{1}\oplus\dots\oplus\Psi_{K}$ does not uniquely determine $\{{\Psi}_{k}\}_{k=1}^{K}$ , i.e., without further constraints the Kronecker sum parameterization is not fully identifiable. It is easy to verify, however, that both $\mathrm{offd}(\Psi_{k})\>\mathrm{and}\>\mathrm{diag}(\Omega)$ are identifiable, where we define the notation $\mathrm{offd}(M)={M}-\mathrm{diag}({M})$ . We can then write the identifiable decomposition

[TABLE]

and correspondingly $\Omega_{0}=\mathrm{diag}(\Omega_{0})+\mathrm{offd}({\Psi}_{0,1})\oplus\dots\oplus\mathrm{offd}({\Psi}_{0,K})$ . Note that while the offdiagonal factors can take on any values, $\mathrm{diag}(\Omega_{0})$ is not completely free (for a fully orthogonal parameterization see Section D of the supplement).

Interpretation of correlation coefficients. The quantities $\frac{[\Psi_{k}]_{ij}}{\sqrt{[\Psi_{k}]_{ii}[\Psi_{k}]_{jj}}}$ do not by themselves correspond to correlation coefficients. Due to the repeating structure of the Kronecker sum each element $[\Psi_{k}]_{ij}$ will appear in $m_{k}$ distinct $d_{k}\times d_{k}$ symmetric subblocks of $\Omega$ , and in each ( $\ell$ th) subblock it will have a correlation coefficient uniquely defined for that subblock:

[TABLE]

where $c_{\ell}={\rm tr}(\ell\mathrm{th\>subblock\>of\>}\Omega)-{\rm tr}(\Psi_{k})$ . The overall correlation structure is preserved across the $m_{k}$ blocks, simply the strength of the correlations are modulated by the contributions of the other $K-1$ additive factors in the block.333Recall that the $\Psi_{k}$ need not be positive definite and $c_{\ell}$ need not be $>0$ .

3 Models and Methods

Let $X_{1},\ldots,X_{n}$ be $n$ independent realizations of the $K$ -way tensor $X$ . Define $\mathbf{x}_{i}=\rm{vec}\left\{\,{X}_{i}^{T}\,\right\}$ for all $i=1,\ldots,n$ . Define ${\widehat{S}}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_{i}\mathbf{x}_{i}^{T}$ as the sample covariance. The mode- $k$ Gram matrix $S_{k}$ and factor-wise covariance $\Sigma^{(k)}=\mathbb{E}[S_{k}]$ are given by

[TABLE]

noting that the elements of these matrices are effectively inner products between $(K-1)$ order tensors. $S_{k}$ is the sample covariance of the data unfolded across the $k$ th tensor axis, while $\Sigma^{(k)}$ denotes the population covariance matrix along the same axis. These Gram matrices $S_{k}$ can be represented as elementwise aggregations over entries in the full sample covariance (3), with locations indexed by $\Psi_{k,i,j}$ as:

[TABLE]

In tensor covariance modeling when the dimension $p$ is much larger than the number of samples $n$ , the Gram matrices $S_{k}$ are often used to model the rows and columns separately, notably in the matrix-variate estimation methods of Zhou (2014) and Kalaitzis et al. (2013). Observe that the TeraLasso estimator (2) of the precision matrix can be expressed as

[TABLE]

where $\mathcal{K}_{\mathbf{p}}^{\sharp}$ is the set of positive semidefinite Kronecker sum matrices (4).

Ignoring regularization, the objective function in curly brackets can be written as $-\log p(\widehat{S}|\Omega)$ where $p(\widehat{S}|\Omega)=\alpha_{\Omega}\prod_{k=1}^{K}p(S_{k}|\Psi_{k})$ and $p(S_{k}|\Psi_{k})=\exp\left(-\langle m_{k}S_{k},\Psi_{k}\rangle\right)$ , with $\alpha_{\Omega}$ a normalizing constant. The non-negativity of the Kullback-Liebler divergence $\int p(S|\Omega)\log\left(\frac{p(S|\Omega)}{\alpha_{\Omega}\prod_{k=1}^{K}p(S_{k}|\Psi_{k})}\right)dS$ implies that the Kronecker sum model is a maximum entropy model, as previously pointed out for the case of $K=2$ by Kalaitzis et al. (2013). Alternatively, Kronecker sum models can be characterized as regularizing the precision matrix estimation problem with a minimally informative prior over the set $\mathcal{K}_{\mathbf{p}}^{\sharp}$ .

The class of Kronecker sum matrices is a highly structured, lower-dimensional subspace of $\mathbb{R}^{p\times p}$ . By definition of the Kronecker sum (1), each entry of $\Psi_{k}$ appears in $m_{k}=p/d_{k}$ entries of $\Omega$ . By imposing that the precision matrix have both Kronecker sum structure and sparse structure through the penalty $g_{\rho}$ , TeraLasso is able to effectively regularize the precision estimation problem.

We assume the penalty $g_{\rho}$ is $(\mu,\gamma)$ -amenable in the sense of Loh et al. (2017).

Definition 1 ( $(\mu,\gamma)$ amenable regularizer)

A regularizer $g_{\rho}(t)$ is $(\mu,\gamma)$ -amenable when $\mu\geq 0$ and $\gamma\in(0,\infty)$ if

$g_{\rho}$ * is symmetric around zero and $g_{\rho}(0)=0$ .* 2. 2.

$g_{\rho}(t)$ * and $g_{\rho}(t)/t$ are both nondecreasing on $\mathbb{R}^{+}$ .* 3. 3.

$g_{\rho}(t)$ * is differentiable for all $t\neq 0$ .* 4. 4.

The function $g_{\rho}(t)+\frac{\mu}{2}t^{2}$ is convex. 5. 5.

$\lim_{t\rightarrow 0^{+}}g^{\prime}_{\rho}(t)=\rho$ . 6. 6.

$g^{\prime}_{\rho}(t)=0$ * for all $t\geq\gamma\rho$ .*

Note that the $\ell 1$ regularizer is $(0,\infty)$ -amenable. Example nonconvex penalties in this class include the SCAD penalty (Fan and Li, 2001) and the MCP penalty (Zhang et al., 2010), both defined in Appendix K of the supplement.

Observe that for nonzero $\mu$ (i.e. nonconvex $g_{\rho}$ ) the constraint on the spectral norm of $\Omega$ ( $\|\Omega\|_{2}\leq\kappa$ ) in the TeraLasso objective function (8) is necessary since without it a global minimum may not exist (Loh et al., 2017). For spectral norm constraint parameter set to $\kappa=\sqrt{2/\mu}$ , we show (Lemma G.34 in the supplement) that (8) with $g_{\rho}$ $(\mu,\gamma)$ -amenable is convex and has a unique global minimizer. For the $\ell 1$ penalty, the objective is always convex and $\kappa$ can be set to infinity.

4 High Dimensional Consistency of the TeraLasso

Let $\mathbf{v}=[v_{1},\dots,v_{p}]^{T}$ be an isotropic $\psi_{2}$ -subgaussian random vector with independent entries $v_{j}$ satisfying ${\mathbb{E}}v_{j}=0$ , $1={\mathbb{E}}v_{j}^{2}\leq\left\lVert v_{j}\right\rVert_{\psi_{2}}\leq K$ . The $\psi_{2}$ condition on a scalar random variable $V$ is equivalent to subgaussian decay of the tails of $V$ , implying $\mathbb{P}\left(|V|>t\right)\leq 2\exp(-t^{2}/c^{2})\;\text{for all}\;t>0$ . The extension to random vectors is straightforward. Specifically, $\mathbf{x}$ is a subgaussian random vector with positive definite covariance $\Sigma\in\mathbb{R}^{p\times p}$ when

[TABLE]

where $\Sigma^{1/2}$ denotes a positive definite square root factor of $\Sigma$ . We then call ${X}\in\mathbb{R}^{d_{1}\times d_{2}\times\dots\times d_{K}}$ to be an order- $K$ subgaussian random tensor with covariance $\Sigma$ when $\mathbf{x}=\mathrm{vec}({X}^{T})$ is a subgaussian random vector in $\mathbb{R}^{p}$ defined as in (9).

We assume the data $X_{1},X_{2},\ldots,X_{n}$ are independent and identically distributed subgaussian random tensors whose inverse covariance follows the Kronecker sum model (1), namely, that $\mathrm{vec}({X_{i}}^{T})\sim\mathbf{x}$ , where $\mathbf{x}$ is a subgaussian random vector in $\mathbb{R}^{p}$ as defined in (9). A special case of the subgaussian model is the Gaussian model, for which the zeros in the precision matrix define the conditional independencies among the variables $X_{i}$ . This conditional independence relation does not hold for the general subgaussian case, but nonetheless strong convergence of the TeraLasso precision matrix estimator is preserved.

In addition to the subgaussian generative model given above, we make the following technical assumptions on the true model, guaranteeing sparsity in $\Omega$ and its eigenvalues being bounded away from zero and infinity.

(A1)

Define the support set of the $k$ th Kronecker sum component $\Psi_{k}$ of the precision matrix by $\mathcal{S}_{k}=\{(i,j):i\neq j,[\Psi_{k}]_{ij}\neq 0\}$ for $k=1,\dots,K$ . We assume $\mathcal{S}_{k}$ is sparse, i.e. $\mathrm{card}(\mathcal{S}_{k})\leq s_{k}$ .

(A2)

The minimal eigenvalue satisfies $\phi_{\min}({\Omega})=\sum_{k=1}^{K}\phi_{\min}({\Psi}_{k})\geq\underline{k}_{\Omega}>0$ , and the maximum eigenvalue satisfies $\phi_{\max}({\Omega})=\sum_{k=1}^{K}\phi_{\max}({\Psi}_{k})\leq\overline{k}_{\Omega}<\infty$ .

Defining the support set of $\Omega$ as $\mathcal{S}=\{(i,j):i\neq j,\}$ , (A1) implies $\mathrm{card}(\mathcal{S})\leq s=\sum_{k=1}^{K}m_{k}s_{k}$ .

4.1 Regularization with $\ell 1$ penalty

With $g_{\rho}(t)=\rho|t|$ , the constraint on $\|\Omega\|_{2}$ is unnecessary, and (8) becomes

[TABLE]

where $\left|\Psi_{k}\right|_{1,{\rm off}}=\sum_{i\not=j}\left\lvert[\Psi_{k}]_{ij}\right\rvert$ is the off diagonal $\ell_{1}$ norm. The objective (10) is jointly convex, and its minimization over $\Omega\in\mathcal{K}_{\mathbf{p}}^{\sharp}$ has a unique solution (see Section B.6 of the supplement). We require an additional assumption

(A3)

The sample size $n$ and the component dimensions $d_{k}$ satisfy the following condition:

[TABLE]

where $m_{k}=p/d_{k}$ and $\kappa(\Sigma_{0})=\phi_{\max}(\Sigma_{0})/\phi_{\min}(\Sigma_{0})$ is the condition number of $\Sigma_{0}$ .

Note this assumption holds for $n=1$ and sufficiently large $(\min_{k}m_{k})^{2}>O({p})$ , which can hold for any $K>2$ . We obtain the following bounds on the Frobenius and operator norm error of the TeraLasso estimator (10). The constants ( $c,C_{1},C_{2},C_{3}$ ) are given in the proof (see the supplement), and do not depend on $K$ , $n$ , $s$ , or $\mathbf{p}$ .

Theorem 1 (Frobenius error bound)

Suppose the assumptions (A1)-(A3) hold, and that $\widehat{\Omega}$ is the minimizer of (10) with $\rho_{k}\asymp\frac{1}{\underline{k}_{\Omega}}\sqrt{\frac{\log p}{nm_{k}}}$ . Then with probability at least $1-2(K+1)\exp(-c\log p)$

[TABLE]

Theorem 2 (Factorwise and L2 error bounds)

Suppose the conditions of Theorem 1 hold. Then with probability at least $1-2(K+1)\exp(-c\log p)$ ,

[TABLE]

and as a result

[TABLE]

Theorems 1 and 2 are proved in Section E of the supplement. Observe that the theorem predicts (2) that, for fixed $n$ and $K>2$ , the estimation error of the parameters of $\Omega$ converges to zero as the dimensions $\{d_{k}\}$ go to infinity (recall that $p=\prod_{k=1}^{K}d_{k}$ ). This implies that for increasing dimensions the TeraLasso will converge even for a single sample $n=1$ . Due to the repeating structure and increasing dimension of $\Omega$ , the parameter estimates can converge without the overall Frobenius error $\|\widehat{\Omega}-\Omega_{0}\|_{F}$ converging.

Comparison to GLasso. The Frobenius norm bound in Theorem 1 improves on the subgaussian GLasso rate of Rothman et al. (2008); Zhou et al. (2011) by a factor of $\min_{k}m_{k}$ . If the dimensions are equal ( $d_{k}=p^{1/K}$ and $s_{k}$ are constant over $k$ ) and $K$ is fixed, Theorem 2 implies $\|\Delta_{k}\|_{F}=O_{p}\left(\sqrt{\frac{(d_{k}+s_{k})\log p}{m_{k}n}}\right)$ , indicating that TeraLasso with $n$ replicates estimates the identifiable representation of $\Psi_{k}$ with an error rate equivalent to that of GLasso with $\Omega=\Psi_{k}$ and $nm_{k}$ available replicates.

Independence along an axis. Suppose that the data tensor $X$ is i.i.d. along the first axis, i.e. $\Psi_{1}=I_{d_{1}}$ . Then instead of a $K$ -way TeraLasso, a $K-1$ model with $nd_{1}$ replicates would suffice, yielding a factorwise error bound (Theorem (2)) of $O\left(\sqrt{\left(1+\sum_{k=2}^{K}\frac{s_{k}}{d_{k}}\right)\frac{\log(p/d_{1})}{nd_{1}\min_{k>1}(m_{k}/d_{1})}}\right)$ , as compared to the factorwise error bound of $O\left(\sqrt{\left(1+\sum_{k=2}^{K}\frac{s_{k}}{d_{k}}\right)\frac{\log(p)}{n\min_{k}m_{k}}}\right)$ associated with the full $K$ -way model (since $s_{1}=0$ ). Hence having a priori knowledge of independence (allowing the use of the $K-1$ model) does not meaningfully improve the rate over the the original $K$ -way model so long as $\min_{k>1}m_{k}\approx\min_{k}m_{k}$ . A similar satisfying result holds for the Frobenius error bound in Theorem 1.

4.2 Nonconvex Regularizers and Single Sample Support Recovery

Nonconvex regularization will provide nonasymptotic guarantees on the elementwise estimation error, implying strong, single sample support recovery guarantees when the smallest nonzero element of $\Omega_{0}$ is bounded from below. On the other hand, these stronger results require more restrictive assumptions on sparsity of the precision matrix and its smallest nonzero element. Specifically, we will require the following:

(A4)

The degree (maximum number of nonzero edges connected to a node) of the sparsity graph of each factor $\Psi_{k}$ is bounded by a constant $d$ .

(A5)

The sample size satisfies: $n\min_{k}m_{k}\geq c_{0}d^{2}\log p$ for some $c_{0}$ large enough.

(A6)

There exist constants $c_{\infty},c_{3}$ such that $\|(\Omega_{0}\otimes\Omega_{0})_{\mathcal{S}\mathcal{S}}\|_{\infty}\leq c_{\infty}$ and

[TABLE]

In (A6) the notation $A_{\mathcal{S}\mathcal{S}}$ denotes the submatrix of $A$ formed by extracting the rows and columns corresponding to the index set $\mathcal{S}$ . Under these assumptions we have the following result.

Theorem 3 (Nonconvex Regularizers)

Suppose the regularizer $g_{\rho}$ in (8) is $(\mu,\gamma)$ -amenable, and $\kappa=\sqrt{2/\mu}$ . Then with probability at least $1-2(K+1)\exp(-c\log p)$ as in Theorem 1, (8) has a unique stationary point $\widehat{\Omega}$ (given by the oracle estimator defined in the supplement), with (for all $k$ )

[TABLE]

The proof is given in Section G in the supplement, and uses arguments analogous to those of Loh et al. (2017) along with concentration inequalities arising from the structure of the TeraLasso model.

Theorem 3 implies that the elements (of both $\Omega$ and the offdiagonals of $\Psi_{k}$ ), and thus the support (of both $\Omega$ and the $\Psi_{k}$ ) can be estimated using a single sample ( $n=1$ ) provided $\min_{k}m_{k}$ is large enough. The Frobenius norm convergence rates (both factorwise and overall) for the convex and nonconvex regularizers remain effectively the same (comparing Theorem 3 to Theorems 1 and 2), hence the primary benefit of the nonconvex bound is the ability to guarantee support recovery in exchange for additional assumptions.

5 TG-ISTA Algorithm

In this section, we introduce an iterative soft thresholding (ISTA) method, restricted to the convex set $\mathcal{K}_{\mathbf{p}}^{\sharp}$ of possible positive semidefinite Kronecker sum precision matrices, to implement the TeraLasso optimization (8). We call this implementation Tensor Graphical Iterative Soft Thresholding (TG-ISTA).

5.1 Composite gradient descent and proximal first order methods

Our goal is to solve the objective (8). This objective function can be decomposed into the sum of a differentiable function $f$ and a lower semi-continuous but nonsmooth function $g$ : for $\Omega\in\mathcal{K}_{\mathbf{p}}$ :

[TABLE]

For objectives of this form, Nesterov (2007) proposed a first order method called composite gradient descent. Composite gradient descent has been specialized to the case of $g=|\cdot|_{1}$ and is widely known as Iterative Soft Thresholding (ISTA) (see for example Tseng (2010); Combettes and Wajs (2005); Beck and Teboulle (2009); Nesterov (1983, 2004)). An extension to nonconvex regularizers $g$ is given in Loh and Wainwright (2013).

The linearity of the constraint set $\mathcal{K}_{\mathbf{p}}$ suggests the use of gradient descent where the gradients are projected onto the associated $1-K+\sum_{k=1}^{K}d_{k}^{2}$ dimensional linear subspace. The positive definite restriction can then be handled in a similar way as Guillot et al. (2012) did for the vanilla GLasso. We therefore derive composite gradient descent in the linear subspace $\mathcal{K}_{\mathbf{p}}$ of $\mathbb{R}^{p^{2}}$ , creating a positive definite sequence of iterates $\{\Omega_{t}\}$ given by the recursion

[TABLE]

where the initial matrix $\Omega_{0}\in\mathcal{K}^{\sharp}_{\mathbf{p}}$ can be chosen as the identity. We enforce the positive semidefinite constraint at each step by performing backtracking line search to find a suitable stepsize $\zeta_{t}$ (see Algorithm 1) (Guillot et al., 2012). We decompose and solve the problem (14) for the case of the TeraLasso objective in Section 5.2 below.

5.2 TG-ISTA implementation of TeraLasso

To apply this form of composite gradient descent to the TeraLasso objective, the projected gradient of $f(\Omega)$ is required for (5.1). For simplicity, consider the $\ell 1$ regularized case. The general nonconvex case is described in the next section and the supplement. Since the gradient of $\langle\widehat{S},\Omega\rangle$ with respect to $\Omega$ is $\widehat{S}$ (Lemma I.55 in the supplementary material)

[TABLE]

While many different conventions for parameterizing the projection using the $\widetilde{S}_{k}$ are possible, the projection remains unique. Alternate parameterizations will not affect the convergence or output of the algorithm. Since the gradient of $-\log|\Omega|$ with respect to $\Omega$ is $\Omega^{-1}$ (Boyd and Vandenberghe, 2009), the projected gradient takes the form

[TABLE]

The matrices $G_{k}^{t}\in\mathbb{R}^{d_{k}\times d_{k}}$ are computed via the expressions given in Lemma I.55 in the supplement. Combining (5.2) and (16), the projected gradient of the objective $f(\Omega_{t})$ is

[TABLE]

Lemma 4 (Decomposition of objective)

For $\Omega_{t},\Omega\in\mathcal{K}_{\mathbf{p}}$ of the form

[TABLE]

the unique solution to (14) with $g_{\rho}=|\cdot|_{1}$ is given by $\Omega_{t+1}=\Psi_{1}^{t+1}\oplus\dots\oplus\Psi_{K}^{t+1}$ where

[TABLE]

The proof is in supplement Section B.5. The right hand side of (18) is the proximal operator of the $\ell_{1}$ penalty on the off diagonal entries. The solution has closed form, as given in Beck and Teboulle (2009),

[TABLE]

where we define the off diagonal shrinkage operator $\mathrm{shrink}^{-}_{\rho}(\cdot)$ as

[TABLE]

The composite gradient descent algorithm is given in Algorithm 1. In Section H of the supplement, a scalable geometric rate of convergence of TG-ISTA to the global minimum is derived (Theorem H.42). In Section C.2 of the supplement we show that each iteration can be computed in $O\left(pK+\sum_{k=1}^{K}d_{k}^{3}\right)$ floating point operations.

5.3 TG-ISTA for a nonconvex regularizer

The estimation algorithm is largely the same as Algorithm 1, except with an additional term added to the gradient. Specifically, the updates are of the form

[TABLE]

where $\zeta$ is the step size and

[TABLE]

The update (23) can be decomposed into the factorwise updates

[TABLE]

where $q^{\prime}_{\rho}(t)=\frac{d}{dt}(g_{\rho}(t)-\rho|t|)$ for $t\neq 0$ and $q^{\prime}_{\rho}(0)=0$ . These updates can be inserted into the framework of Algorithm 1, with an added step of enforcing the $\|\Omega\|_{2}\leq\kappa$ constraint, e.g. via step size line search. The algorithm is summarized in Algorithm 2 in Supplement B.1.

Theorem 5 (Convergence of Algorithm 2)

Algorithm 2 will converge to the global optimum when the norm constraint parameter $\kappa$ is chosen to be less than or equal to $\sqrt{2/\mu}$ .

Proof 5.6.

Follows since for $\kappa\leq\sqrt{2/\mu}$ the objective (8) is convex on the convex constraint set $\{\Omega\in\mathcal{K}_{\mathbf{p}}|\Omega\succ 0,\|\Omega\|_{2}\leq\kappa\}$ (Lemma G.34, supplement).

6 Validation on synthetic data

Random graphs were created for each factor $\Psi_{k}$ using both an Erdos-Renyi (ER) topology and a random grid graph topology444Code for experiments is included in the supplementary material and can be found at https://github.com/kgreenewald/teralasso.. These ER type graphs were generated according to the method of Zhou et al. (2010). Initially we set $\Psi_{k}=0.25I_{n\times n}$ , where $n=100$ , and randomly select $q$ edges and update $\Psi_{k}$ as follows: for each new edge $(i,j)$ , a weight $a>0$ is chosen uniformly at random from $[0.2,0.4]$ ; we subtract $a$ from $[\Psi_{k}]_{ij}$ and $[\Psi_{k}]_{ji}$ , and increase $[\Psi_{k}]_{ii},[\Psi_{k}]_{jj}$ by $a$ . This keeps $\Psi_{k}$ positive definite. We repeat this process until all edges are added. Finally, we form $\Omega=\Psi_{1}\oplus\dots\oplus\Psi_{K}$ . An example 25-node, $q=25$ ER graph and precision matrix are shown in Figure 3.

The random grid graph is produced in a similar way, with the exception that edges are only allowed between adjacent nodes, where the nodes are arranged on a square grid (Figure 3(b)). Algorithm 1 in Section B.3 of the supplement describes how the random vector $\mathbf{x}=\mathrm{vec}(X^{T})$ is generated under the Kronecker sum model.

6.1 Validation of theoretical algorithmic convergence rates

To verify the geometric convergence of the TG-ISTA implementation (Theorem H.42 in the supplement), we generated Kronecker sum inverse covariance graphs and plotted the Frobenius norm between the inverse covariance iterates $\Omega_{t}$ and the optimal point $\Omega^{*}$ . We set the $\Psi_{k}$ to be random ER graphs with $d_{k}$ edges where $d_{1}=\dots=d_{K}$ , and determined the value for $\rho_{k}=\rho$ using cross validation. Figure 4 shows the results as a function of iteration, for a variety of $d_{k}$ and $K$ configurations and the $\ell$ 1 convex regularization. Figure 12 in Supplement B.1 repeats these experiments with the nonconvex SCAD and MCP penalties, using the same random seed. For comparison, the statistical error of the optimal point is also shown, as optimizing beyond this level provides reduced benefit. As predicted, linear or better convergence to the global optimum is observed. The small number of iterations combined with the low computational cost per iteration confirm the algorithmic efficiency of the TG-ISTA implementation of TeraLasso. Additional numerical experiments demonstrating fast convergence on larger scale problems are given in Section C.2 of the supplement.

6.2 Regularization with $\ell 1$ penalty

In the TeraLasso objective (10), the sparsity of the estimate is controlled by $K$ distinct tuning parameters $\rho_{k}$ for $k=1,\dots,K$ . The convergence condition on $\rho_{k}$ in Theorem 1 suggests that the $\rho_{k}$ can be set as $\rho_{k}=\bar{\rho}\sqrt{\frac{\log p}{nm_{k}}}$ with $\bar{\rho}$ being a single scalar tuning parameter, depending on absolute constants and $\|\Sigma\|_{2}$ . Below, we experimentally validate the reliability of this tuning strategy.

The performance is empirically evaluated using several metrics including: the Frobenius norm ( $\|\widehat{\Omega}-\Omega_{0}\|_{F}$ ) and spectral norm ( $\|\widehat{\Omega}-\Omega_{0}\|_{2}$ ) error of the precision matrix estimate $\widehat{\Omega}$ and the Matthews correlation coefficient to quantify the edge misclassification error. Let the number of true positive edge detections be TP, true negatives TN, false positives FP, and false negatives FN. The Matthews correlation coefficient is defined as (Matthews, 1975)

[TABLE]

where each nonzero off diagonal element of $\Psi_{k}$ is considered as a single edge. Larger values of MCC imply better edge estimation performance, with $\mathrm{MCC}=0$ implying complete failure and $\mathrm{MCC}=1$ perfect edge set estimation.

Shown in Figure 5 are the MCC, normalized Frobenius error, and spectral norm error as functions of $\bar{\rho}_{1}$ and $\bar{\rho}_{2}$ where the $\bar{\rho}_{k}$ constants giving ${\rho}_{k}=\frac{\bar{\rho}_{k}}{\sqrt{(\log p)/(nm_{k})}}.$ Note $\bar{\rho}_{1}=\bar{\rho}_{2}=\bar{\rho}_{3}$ achieves near optimal results.

Having verified the single tuning parameter approach, hereafter we will cross-validate only $\bar{\rho}$ . In supplement Section C.3, we provide experimental verification in a wide variety of experimental settings (including varying the relative size of the tensor dimensions $d_{k}$ ) that our bounds on the rate of convergence for the $\ell$ 1 regularized model are tight. Figure 6 illustrates how increasing dimension $p$ and $K$ improves single sample performance. Shown are the average TeraLasso edge detection precision and recall values for different values of $K$ in the single and 5-sample regimes, all increasing to $1$ (perfect structure estimation) as $p$ , $K$ , and $n$ increase.

6.3 Nonconvex Regularization

Here the $\ell 1$ penalized TeraLasso is compared to TeraLasso with nonconvex regularization (8). Shown in Figure 7 are the MCC, normalized Frobenius error, and spectral norm error for estimating $K=2$ and $K=3$ Erdos-Renyi graphs as functions of regularization parameter ${\rho}$ for each of $\ell$ 1, SCAD (103), and MCP (104) regularizers in a variety of configurations. Figure 8 shows similar results for $\Psi_{k}$ a variant of the spiked identity model of Loh et al. (2017). Observe that nonconvex regularization improves performance slightly, not only for structure estimation (MCC) but for the Frobenius norm error (due to the reduction in bias) as well. This improvement is increased in the spiked identity case.

7 NCEP Windspeed Data

The TeraLasso model is illustrated on a meteorological dataset. The US National Center for Environmental Prediction (NCEP) maintains records of average daily wind velocities in the lower troposphere, with daily readings beginning in 1948. The data is available online at ftp://ftp.cdc.noaa.gov/Datasets/ ncep.reanalysis.dailyavgs/surface. Velocities are recorded globally, in a $144\times 73$ latitude-longitude grid with spacings of 2.5 degrees in each coordinate. Over bounded areas, the spacing is approximately a rectangular grid, suggesting a $K=2$ model (latitude vs. longitude) for the spatial covariance, and a $K=3$ model (latitude vs. longitude vs. time) for the full spatio-temporal covariance.

Consider the time series of daily-average wind speeds. Following Tsiligkaridis and Hero (2013), we regress out the mean for each day in the year via a 14-th order polynomial regression on the entire history from 1948-2015. We extract two $20\times 10$ spatial grids, one from eastern North America, and one from western North America (Figure 9). Figure 10 shows the TeraLasso estimates for latitude and longitude factors using time samples from January in $n$ years following 1948, for both the eastern and western grids. Observe the approximate AR structure, and the break in correlation (Figure 10 (b), longitude factor) in the Western Longitude factor. The location of this break corresponds to the high elevation line of the Rocky Mountains. In the supplement, we compare the TeraLasso estimator to the unstructured shrinkage estimator, the non-sparse Kronecker sum estimator (TeraLasso estimator with sparsity parameter $\rho=0$ ), and the Gemini sparse Kronecker product estimator of Zhou (2014). It is shown that the TeraLasso provides a significantly better fit to the data.

To illustrate the utility of the estimated precision matrices, we use them to construct a season classifier. NCEP windspeed records are taken from the 51-year span from 1948-2009. We estimate spatial precision matrices on $n$ consecutive days in January and June of a training year respectively, and running anomaly detection on $m=30$ -day sequences of observations in the remaining 50 testing years. We report average classifier performance by averaging over all 51 possible partitions of the 51-year data into 1 training and 50 testing years. The sequences are labeled as summer (June), and winter (January), and we compute the classification error rate for the winter vs. summer classifier obtained by choosing the season associated with the larger of the likelihood functions

[TABLE]

We consider the $K=3$ spatial-temporal precision matrix for a spatial-temporal array of size $10\times 20\times T$ , with the first ( $10\times 10$ ) factor corresponding to the latitude axis of the spatial array, the second a $20\times 20$ factor corresponding to the longitude axis, and the third factor a $T\times T$ factor corresponding to a temporal axis of length $T$ . The spatial-temporal array is created by concatenating $T$ temporally consecutive $10\times 20$ spatial samples. We use $\ell 1$ regularization.

Results for different sized temporal covariance extents ( $T=d_{3}$ ) are shown in Figure 11 for TeraLasso, with unregularized TeraLasso (ML Kronecker Sum) and maximum likelihood Kronecker product estimator (Werner et al., 2008; Tsiligkaridis et al., 2013) results shown for comparison. In this experiment, we use the ML Kronecker product estimator instead of the Gemini, as for this maximum-likelihood classification task the maximum-likelihood based approach performs significantly better than the factorwise objective approach of the Gemini estimators, which is not surprising as the Kronecker product is not a good fit for this data (Section C.4 of the supplement). Note the superior performance and increased single sample robustness of the proposed ML Kronecker Sum and TeraLasso estimates as compared to the Kronecker product estimate, confirming the better fit of TeraLasso. In each case, the nonmonotonic behavior of the Kronecker product curves is due partly to randomness associated with the small test sample size, and partly due to the fact that the Kronecker product in $K=3$ has overly strong coupling across tensor directions, giving large bias.

8 Conclusion

A factorized model, called the TeraLasso, is proposed for the precision matrix of tensor-valued data that uses Kronecker sum structure and sparsity to regularize the precision matrix estimate. An ISTA-like optimization algorithm is presented that scales to high dimensions. Statistical and algorithmic convergence are established for the TeraLasso that quantify performance gains relative to other structured and unstructured approaches. Numerical results demonstrate single-sample convergence as well as tightness of the bounds. Finally, an application to real tensor-valued ( $K=3$ ) meteorological data is considered, where the TeraLasso model is shown to fit the data well and enable improved single-sample performance for estimation and anomaly detection. Future work includes combining first moment tensor representation methods for mean estimation such as PARAFAC (Harshman and Lundy, 1994) with the second order TeraLasso method introduced in this paper for estimating the covariance.

9 Acknowledgement

The research reported in this paper was partially supported by US Army Research Office grant W911NF-15-1-0479, US Department of Energy grant DE-NA0002534, NSF grant DMS-1316731, and the Elizabeth Caroline Crosby Research Award from the Advance Program at the University of Michigan.

Appendix A Appendix outline

This supplement is organized as follows. Sections B-C focus on the implementation and numerical convergence of the TeraLasso algorithm and Sections D-H focus on theory and proofs of convergence. Section B presents the algorithm for TeraLasso with nonconvex regularization and describes additional properties of the TeraLasso algorithm, including a discussion of the choice of step size, decomposition of the gradient update, and proof of joint convexity of the objective. Section C presents additional numerical experiments, including convergence of the nonconvex algorithm, larger scale TG-ISTA convergence experiments, additional discussion comparing the fit of the TeraLasso model to the wind speed data, and a discussion of the geometric differences between the Gemini and TeraLasso objectives.

We then proceed to the convergence analysis. Section D describes properties of the Kronecker sum and the Kronecker sum subspace $\mathcal{K}_{\mathbf{p}}$ that are needed for the remainder of the discussion. Proof of the main Frobenius norm theorem and of the spectral norm theorem are in Section E, with the concentration bounds proven in Section F. Section G proves the result on nonconvex regularization, and Section H presents and proves theorems on the geometrical convergence of the TG-ISTA algorithm. Relevant properties and identities relating to the space $\mathcal{K}_{\mathbf{b}}$ spanned by Kronecker sum matrices are contained in Appendix I, and a discussion of the case where the diagonal elements of $\Omega$ are known is given in Appendix J.

Appendix B TeraLasso algorithm step size and numerical convergence proofs

B.1 Convergence of nonconvex regularization algorithm

The TG-ISTA implementation of the TeraLasso algorithm for nonconvex regularizers is shown in Algorithm 2. The primary differences from the $\ell$ 1 regularized case are (a) the addition of the norm constraint, and (b) the use of the nonconvex regularizer in the gradient computation.

B.2 Choice of step size $\zeta_{t}$

Here we propose a method (24) for selecting the stepsize parameter $\zeta_{t}$ at each step $t$ that ensures convergence of the algorithm. We follow the approach of Beck and Teboulle (2009) and Guillot et al. (2012). Since $\Omega_{t}\succ 0$ and the the positive definite cone is an open set, there will always exist a $\zeta_{t}$ small enough such that $\Omega_{t+1}\succ 0$ . We prove geometric convergence when $\zeta_{t}$ is chosen such that $\Omega_{t+1}\succ 0$ and

[TABLE]

where $\mathcal{Q}_{\zeta_{t}}$ is a quadratic approximation to $f$ given by

[TABLE]

At each iteration $t$ , we thus perform a line search to select an appropriate $\zeta_{t}$ . We first select an initial stepsize $\zeta_{t,0}$ and compute the update (19). If the resulting $\Omega_{t+1}$ is not positive definite or does not decrease the objective sufficiently according to (24), we decrease the stepsize $\zeta_{t}$ to $c\zeta_{t,0}$ for $c\in(0,1)$ and re-evaluate if the resulting $\Omega_{t+1}$ satisfies the conditions. This backtracking process is repeated (setting stepsize equal to $c^{j}\zeta_{t,0}$ where $j$ is incremented) until the resulting $\Omega_{t+1}$ satisfies the conditions. Since by construction $\Omega_{t}$ is positive definite, and the positive definite cone is an open set, there will be a step size small enough such that the conditions are satisfied. In practice, if after a set number of backtracking steps the conditions are still not satisfied, we can always take the safe step

[TABLE]

As the safe stepsize often leads to slower convergence, we use the more aggressive Barzilai-Borwein step to set a starting $\zeta_{t,0}$ at each time. The Barzilai-Borwein stepsize presented in Barzilai and Borwein (1988) creates an approximation to the Hessian, in our case given by

[TABLE]

We derive the gradient $\nabla f(\Omega_{t})$ in the next section. The norms and inner products in (26) and (25) can be efficiently computed factorwise (using the $\Psi_{k}$ and $S_{k}$ only) using the formulas in Appendix I.1.

B.3 Generation of Kronecker Sum Random Tensors

Generating random tensors given a Kronecker sum precision matrix can be made efficient by exploiting the Kronecker sum eigenstructure. Algorithm 3 allows efficient generation of data following the TeraLasso model.

B.4 Detailed TeraLasso Algorithm

Algorithm 4 shows additional details of the implementation of Algorithm 1 in the main text.

B.5 Decomposition of Objective: Proof of Lemma 4

For simplicity of notation define $G_{t}$ to be the projection of $\Omega^{-1}$ onto the cone $\mathcal{K}_{\mathbf{p}}$ of positive definite Kronecker sum matrices:

[TABLE]

Using this notation and substituting in (17) from the main text, the objective (14) becomes

[TABLE]

Expanding out the Kronecker sums, for

[TABLE]

the Frobenius norm term in the objective (27) can be decomposed into a sum of a diagonal portion and a factor-wise sum of the off diagonal portions. This holds by Property 2 in Appendix A which states the off diagonal factors $\Psi_{k}^{-}$ have disjoint support in $\Omega$ . Thus,

[TABLE]

Substituting into the objective (27), we obtain

[TABLE]

This objective is decomposable into a sum of terms each involving either the diagonal $\Omega^{+}$ or one of the off diagonal factors $\Psi_{k}^{-}$ . Thus, we can solve for each portion of $\Omega$ independently, giving

[TABLE]

Since the diagonal $\mathrm{diag}(\Omega)$ is not regularized in (28), we have

[TABLE]

i.e.

[TABLE]

This means we can equivalently obtain the solution of the problem (28) by solving

[TABLE]

completing the proof.

∎

B.6 Proof of Joint Convexity

Our objective function is

[TABLE]

We have the following theorem. This theorem proves the joint convexity of the objective function (30) and the uniqueness of the minimizer $\widehat{\Omega}$ .

Theorem B.7.

The objective function (30) is jointly convex in $\{{\Psi}_{k}\}_{k=1}^{K}$ . Furthermore, define the set $\mathcal{A}=\{\{{\Psi}_{k}\}_{k=1}^{K}|Q(\{{\Psi}_{k}\}_{k=1}^{K})=Q^{*}\}$ where the global minimum $Q^{*}=\min_{\{{\Psi}_{k}\}_{k=1}^{K}}Q(\{{\Psi}_{k}\}_{k=1}^{K})$ . There exists a unique ${\Omega}_{*}\in\mathcal{K}_{\mathbf{p}}^{\sharp}$ , defined in (4), that achieves the minimum of $Q$ such that

[TABLE]

Proof B.8.

By definition,

[TABLE]

is an affine function of $\mathbf{z}=[\mathrm{vec}({\Psi}_{1});\dots;\mathrm{vec}({\Psi}_{K})]$ . Thus, since $\log|{A}|$ is a concave function on the space of positive definite matrices (Boyd and Vandenberghe, 2009), all the terms of $Q$ are convex since convex functions of affine functions are convex and the elementwise $\ell_{1}$ norm is convex. Hence $Q$ is jointly convex in $\{{\Psi}_{k}\}_{k=1}^{K}$ on $\mathcal{K}_{\mathbf{p}}^{\sharp}$ . Hence, every local minima is also global. Furthermore, for positive $\rho_{k}$ at least one global minimum must exist since $|\cdot|_{1}$ has a global minimum at zero.

We show that a nonempty set of $\{{\Psi}_{k}\}_{k=1}^{K}$ such that $Q(\{{\Psi}_{k}\}_{k=1}^{K})$ is minimized maps to a unique ${\Omega}={\Psi}_{1}\oplus\dots\oplus{\Psi}_{K}$ . If only one point $\{{\Psi}_{k}\}_{k=1}^{K}$ exists that achieves the global minimum, then the statement is proved. Otherwise, suppose that two distinct points $\{{\Psi}_{k,1}\}_{k=1}^{K}$ and $\{{\Psi}_{k,2}\}_{k=1}^{K}$ achieve the global minimum $Q^{*}$ . Then, for all $k$ define

[TABLE]

By convexity, $Q(\{{\Psi}_{k,\alpha}\}_{k=1}^{K})=Q^{*}$ for all $\alpha\in[0,1]$ , i.e. $Q$ is constant along the specified affine line segment. This can only be true if (up to an additive constant) the first two terms of $Q$ are equal to the negative of the second two terms along the specified segment. Since

[TABLE]

is strictly convex and smooth on the positive definite cone (i.e. the second derivative along any line never vanishes) (Boyd and Vandenberghe, 2009) and the sum of the two elementwise $\ell$ 1 norms along any affine combination of variables is at most piecewise linear when smooth, this cannot hold when ${\Omega}_{\alpha}={\Psi}_{1,\alpha}\oplus\cdots\oplus{\Psi}_{K,\alpha}$ varies with $\alpha$ . Hence, ${\Omega}_{\alpha}$ must be a constant ${\Omega}^{*}$ with respect to $\alpha$ . Thus, the minimizing ${\Omega}^{*}$ is unique and Theorem B.7 is established.

∎

Appendix C Additional experiments

C.1 Convergence of nonconvex regularization algorithm

Figure 12 illustrates the convergence of the nonconvex Algorithm 2 (experiment described more thoroughly in the main text).

C.2 Computational Complexity of TG-ISTA

In Section H, we show that TG-ISTA reaches the statistical error floor in

[TABLE]

iterations.

Each TG-ISTA iteration is also computationally efficient. Due to the representation (10), the TG-ISTA implementation of TeraLasso never needs to form the full $p\times p$ covariance. The memory footprint of the proposed implementation is $O(p+\sum_{k=1}^{K}d_{k}^{2})$ as opposed to the $O(p^{2})$ storage required by BiGLasso and GLasso. Since the training data itself requires $O(np)$ storage, the storage footprint of the TG-ISTA implementation of TeraLasso is scalable to large values of $p=\prod_{k=1}^{K}d_{k}$ when the $d_{k}/p$ decrease in $p$ , e.g. $d_{k}=p^{1/K}$ . The computational cost per iteration is dominated by the computation of the gradient, which is performed by doing $K$ eigendecompositions of size $d_{1},\dots,d_{K}$ respectively and then computing the projection of the inverse of the Kronecker sum of the resulting eigenvalues. The former step costs $O(\sum_{k=1}^{K}d_{k}^{3})$ , and the second step costs $O(pK)$ , giving a cost per iteration of $O\left(pK+\sum_{k=1}^{K}d_{k}^{3}\right)$ . For $K>1$ and $d_{k}/p\ll 1$ , this gives a dramatic improvement on the $O(p^{3})=O(\prod_{k=1}^{K}d_{k}^{3})$ cost per iteration of unstructured Graphical Lasso algorithms (Guillot et al., 2012; Hsieh et al., 2014). In addition, for $K\leq 3$ the cost per iteration is comparable to the $O(d_{1}^{3}+d_{2}^{3}+d_{3}^{3})$ cost per iteration of the most efficient ( $K=3$ ) Kronecker product GLasso methods such as Zhou (2014).

Figure 13 shows convergence speeds on various random ER graph estimation scenarios, with the BiGLasso of Kalaitzis et al. (2013) shown for comparison. Note that the BiGLasso algorithm only applies when the diagonal elements of $\Omega$ are known, so it cannot be considered to solve the general BiGLasso or TeraLasso objectives. Observe that TeraLasso’s ability to efficiently exploit the Kronecker sum structure to obtain computational and memory savings allows it to quickly converge to the optimal solution, while the alternating-minimization based BiGLasso algorithm is impractically slow. All computation was timed on a 4-core, 64 bit, 2.5GHz CPU system using Matlab 2016b.

C.3 Convergence rate verification

In this section, we verify that our bounds on the rate of convergence are tight in the case of $\ell$ 1 regularization. We will hold $\|\Sigma_{0}\|_{2}$ and $s/p$ constant. We set $\rho_{k}$ as in Theorem 1. By Lemma D.9 in the supplement, this implies an “effective sample size” proportional to the inverse of the bound on $\|\widehat{\Omega}-\Omega_{0}\|_{F}^{2}/p$ :

[TABLE]

For each experiment below, we varied $K$ and $d_{2}$ over 6 scenarios. To ensure that the constants in the bound were minimally affected, we held $\Psi_{1}$ constant over all $(K,d_{2})$ scenarios, and let $\Psi_{3}=0$ and $d_{3}=d_{1}$ when $K=3$ . We let $d_{2}$ vary by powers of 2, i.e. $d_{2}(c_{d})=2^{c_{d}}d_{2,\mathrm{base}}$ where $d_{2,\mathrm{base}}$ is a constant, allowing us to create a fixed matrix $B$ and set $\Psi_{2}=I_{d_{2}/d_{2,\mathrm{base}}}\otimes B$ to ensure the eigenvalues of $\Psi_{2}$ and thus $\|\Sigma_{0}\|_{2}$ remain unaffected as $d_{2}$ ( $c_{d}$ ) changes.

Results averaged over random training data realizations are shown in Figure 14 for ER ( $d_{k}/2$ edges per factor), random grid ( $d_{k}/2$ edges per factor), and AR-1 graphs (AR parameter $.5$ for both factors). Observe that in each case, the curves for all scenarios are very close despite the wide variation in dimensionality, indicating that our bound on the rate of convergence in Frobenius norm is tight.

C.4 Additional details for wind speed data experiments

For the wind speed data example in the main text, we first regressed out the mean for each day in the year via a 14-th order polynomial regression on the entire history from 1948-2015. As in the main text, we extracted two $20\times 10$ spatial grids, one from eastern North America, and one from Western North America, with the latter including an expansive high-elevation area and both Atlantic and Pacific oceans (Figure 9). We compare the TeraLasso estimator to the unstructured shrinkage estimator, the non-sparse Kronecker sum estimator (TeraLasso estimator with sparsity parameter $\rho=0$ ), and the Gemini sparse Kronecker product estimator of Zhou (2014). Figure 15 shows the estimated precision matrices trained on the eastern grid, using time samples from January in $n$ years following 1948. Note the graphical structure reflects approximate auto-regressive (AR) spatial and temporal structure in each dimension. The TeraLasso estimation is much more stable than the Kronecker product estimation for small sample size $n$ .

To quantify the fit of the estimated precision matrices to the observed wind data, we compare to an unstructured estimator in a higher sample regime. After training each estimated precision matrix (TeraLasso, Gemini, and ML Kronecker Product) on a 30-day summer interval from 1 year, as in the main experiment, we create a sample covariance $\widehat{S}_{\mathrm{test}}$ from the same 30-day summer intervals in the remaining 50 years. We evaluate the precision matrices estimated by TeraLasso, Gemini, and ML Kronecker product using a normalized Frobenius error metric:

[TABLE]

If this metric is small, the structured $\widehat{\Omega}$ is close to the unstructured $(\widehat{S}_{\mathrm{test}}+\delta I_{p})^{-1}$ , indicating a good fit to the data. The small ridge $\delta$ is included to ensure that the unstructured inverse estimator $(\widehat{S}_{\mathrm{test}}+\delta I_{p})^{-1}$ is well-conditioned, with the minimum taken over $\delta$ to present the most optimistic view of Gemini and the ML Kronecker product. The results for each precision matrix are TeraLasso: 0.0728, Gemini: 0.903, and ML Kronecker Product: 0.76, confirming the superior performance of the TeraLasso estimator.

C.5 Comparison between TeraLasso and Gemini (Kronecker product) log determinant geometry

In this section, we present further analysis of the relation of the performance of TeraLasso in this wind data setting to its inherently more robust eigenstructure.

Recall the $\ell 1$ TeraLasso objective

[TABLE]

where $m_{k}=p/d_{k}$ . The Gemini Kronecker product algorithm Zhou (2014) uses a similar objective function to estimate the Kronecker product covariance, which can be shown to be equivalent to

[TABLE]

Observe that, for $K=2$ , the Gemini objective function (37) is the same as in TeraLasso objective function (36) except for the log determinant term. Figure 16 (a) compares the Kronecker product Gemini estimator to TeraLasso on data generated using precision matrix $\Psi_{1}\oplus\Psi_{2}$ , and again on data generated using the Kronecker sum precision matrix $\Psi_{1}\otimes\Psi_{2}$ , where $\Psi_{1},\Psi_{2}$ are each $10\times 10$ random ER graphs (generated as in the main text) with 5 nonzero edges. In all cases, we used the theoretically dictated optimal $\ell_{1}$ penalty for TeraLasso from Theorem 1 in the main text and for Gemini from Theorem 3.1 in Zhou (2014). Note that both methods perform well in the single sample regime, even under model misspecification. This apparent symmetricity is very different from the relation of the ML Kronecker sum (TeraLasso with zero penalty) and the ML Kronecker product (not directly related to Gemini), whose results on the same data are also shown in Figure 16 (b). In this case, the ML Kronecker product performs poorly in the single sample regime, whereas the ML Kronecker sum performs well in all regimes, surpassing the ML Kronecker product method in the low sample regime even when the data is generated under the Kronecker product model.

This seems to indicate that the Gemini estimator leverages some of the inherent stability of the ML Kronecker sum objective (TeraLasso) to solve the more unstable Kronecker product covariance estimation problem.

To further illuminate the connection between TeraLasso and Gemini, we now examine the relationship of the geometry of the differing log determinant terms. Let the eigenvalues of $\Psi_{k}$ be denoted as $\lambda_{k,1},\dots,\lambda_{k,d_{k}}$ , and suppose that $\Psi_{1}\oplus\dots\oplus\Psi_{K}\succ 0$ so we can assume all the $\lambda_{k,i}\geq 0$ . Using the properties of determinants and the additivity of the eigenvalues in a Kronecker sum we can write

[TABLE]

Observe that the partial derivative of the log determinant with respect to any one eigenvalue $\lambda_{k,i_{k}}$ is $\sum_{i_{1},\dots,i_{k-1},i_{k+1},\dots i_{K}}1/|\lambda_{1,i_{1}}+\dots+\lambda_{K,i_{K}}|\leq m_{k}/|\lambda_{k,i_{k}}|$ .

Correspondingly, the log determinant of a Kronecker product is

[TABLE]

Observe that the partial derivative of the log determinant with respect to any one eigenvalue $\lambda_{k,i_{k}}$ is $m_{k}/|\lambda_{k,i_{k}}|$ .

Thus, the geometry of the Kronecker sum log determinant term is significantly flatter than the Kronecker product log determinant, especially for larger $K$ , indicating that the Kronecker sum estimator (TeraLasso) will enjoy more flexibility when matching the sample covariances than a Kronecker product method will.

A parallel interpretation can be obtained by recalling that the Kronecker sum of two sparse graphs is significantly sparser than the Kronecker product of the same two graphs, as discussed in the introduction of the main text.

Appendix D Identifiable Parameterization of $\mathcal{K}_{\mathbf{p}}$

Observe that for any scalar $c$

[TABLE]

and thus the trace of each factor is non-identifiable, and we can write

[TABLE]

where $c_{k}$ are any scalars such that $\sum_{k=1}^{K}c_{k}=0$ .

The following lemma addresses this trace ambiguity, and creates an orthogonal, identifiable decomposition of $\Omega$ into factors.

Based on the original parameterization

[TABLE]

we know that the number of degrees of freedom in $B$ is much smaller than the number of elements $p^{2}$ . We thus seek a lower-dimensional parameterization of $B$ . The Kronecker sum parameterization is not identifiable on the diagonals, so we seek a representation of $B$ that is identifiable. In the main text, we noted that $\mathrm{diag}(B)+\mathrm{offd}(A_{1})\oplus\dots\oplus\mathrm{offd}(A_{K})$ is identifiable (where $\mathrm{offd(A)}=A-\mathrm{diag}(A)$ ), but $\mathrm{diag}(B)$ cannot be a parameter of the model since not all diagonal vectors can be expressed as a Kronecker sum. Hence while this diagonal-based decomposition is useful for stating identifiable factorwise error bounds, it is does not truly serve as a parameterization. We show in Lemma D.9 that the space $\mathcal{K}_{\mathbf{p}}$ is linearly, identifiably, and orthogonally parameterized by the quantities $\left(\tau_{B}\in\mathbb{R},\left\{\widetilde{A}_{k}\in\{A\in\mathbb{R}^{d_{k}\times d_{k}}|{\rm tr}(A)\equiv 0\}\right\}_{k=1}^{K}\right)$ . Specifically,

Lemma D.9.

Let $B\in\mathcal{K}_{\mathbf{p}}$ and $B=A_{1}\oplus\dots\oplus A_{K}\in\mathcal{K}_{\mathbf{p}}$ . Then $B$ can be identifiably written as

[TABLE]

where ${\rm tr}(\widetilde{A}_{k})\equiv 0$ and the identifiable parameters $(\tau_{B},\{\widetilde{A}_{k}\}_{k=1}^{K})$ can be computed as

[TABLE]

By orthogonality, the Frobenius norm can be decomposed as

[TABLE]

noting that

[TABLE]

Proof D.10.

Part I: Identifiable Parameterization. Let $B\in{\mathcal{K}}_{\mathbf{p}}$ . By definition, there exists $A_{1},\dots,A_{K}$ such that

[TABLE]

where $\tau_{k}={\rm tr}(A_{k})/d_{k}$ . Observe that ${\rm tr}(A_{k}-\tau_{k}I_{d_{k}})=0$ by construction, so we can set $\widetilde{A}_{k}=A_{k}-\tau_{k}I_{d_{k}}$ , creating

[TABLE]

Note that in this representation, ${\rm tr}(\widetilde{A}_{1}\oplus\dots\oplus\widetilde{A}_{K})=0$ , so letting $\tau_{B}={\rm tr}(B)/p$ ,

[TABLE]

and (39) in the Lemma results. It is easy to verify any $B$ expressible in the form (39) is in ${\mathcal{K}}_{\mathbf{p}}$ .

Thus, $(\tau_{B},\{\widetilde{A}_{k}\}_{k=1}^{K})$ parameterizes ${\mathcal{K}}_{\mathbf{p}}$ . It remains to show that this parameterization is identifiable.

Part II: Orthogonal Parameterization. We will show that under the linear parameterization of $\mathcal{K}_{\mathbf{p}}$ by $(\tau_{B},\{\widetilde{A}_{k}\}_{k=1}^{K})$ , each of the $K+1$ components are linearly independent of the others.

To see this, we compute the inner products between the components:

[TABLE]

for all $k\neq\ell$ . We have recalled that by definition, ${\rm tr}(\widetilde{A}_{k})\equiv 0$ for all $k$ . Since all the inner products are identically zero, the components are orthogonal, thus they are linearly independent. Hence, by the definition of linear independence, this linear parameterization $(\tau_{B},\{\widetilde{A}_{k}\}_{k=1}^{K})$ is uniquely determined by $B\in\mathcal{K}_{\mathbf{p}}$ (i.e. it is identifiable).

Part III: Decomposition of Frobenius norm. Using the identifiability and orthogonality of this parameterization, we can find a direct factorwise decomposition of the Frobenius norm on $\mathcal{K}_{\mathbf{p}}$ .

By orthogonality (cross term inner products equal to zero)

[TABLE]

This completes the first decomposition, representing the squared Frobenius norm as weighted sum of the squared Frobenius norms on each component.

For convenience, we also observe that given any $B\in\mathcal{K}_{\mathbf{p}}$ with identifiable parameterization

[TABLE]

we can absorb the scaled identity into the Kronecker sum and still bound the Frobenius norm decomposition. Specifically, observe that

[TABLE]

Substituting this into (41),

[TABLE]

where the last term follows because ${\rm tr}(\widetilde{A}_{k})\equiv 0$ implies that $\langle I_{d_{k}},\widetilde{A}_{k}\rangle\equiv 0$ .

Observe that

[TABLE]

hence Lemma D.9 is proved.

∎

The identifiable parameterization of $\mathcal{K}_{\mathbf{p}}$ in Lemma D.9 will provide a way to bound the spectral norm relative to the Frobenius norm. This is used to form the spectral norm bound in Theorem 2.

The following lemma is also used in the proof of Theorem 1 (cf. Proposition E.26).

Lemma D.11 (Spectral Norm Bound).

For all $B\in\mathcal{K}_{\mathbf{p}}$ ,

[TABLE]

Proof D.12.

Using the identifiable parameterization of $B$

[TABLE]

and the triangle inequality, we have

[TABLE]

∎

D.1 Inner Product in $\mathcal{K}_{\mathbf{p}}$

Lemma D.13 (Kronecker sum inner Products).

Suppose $B\in\mathbb{R}^{p\times p}$ . Then for any $A_{k}\in\mathbb{R}^{d_{k}\times d_{k}}$ , $k=1,\dots,K$ ,

[TABLE]

Proof D.14.

[TABLE]

where we have used the definition of the submatrix notation $B(i,i|k)$ and the matrices $B_{k}=\frac{1}{m_{k}}\sum_{i=1}^{m_{k}}B(i,i|k)$ . See Appendix I for the notation being used here. ∎

Appendix E Proof of Theorems 1 and 2 ( $\ell 1$ regularized case)

Let $\Omega_{0}$ be the true value of the precision matrix $\Omega$ . Since $\Omega,\Omega_{0}\in\mathcal{K}_{\mathbf{p}}$ and $\mathcal{K}_{\mathbf{p}}$ is convex, $\Delta_{\Omega}={\Omega}-\Omega_{0}\in\mathcal{K}_{\mathbf{p}}$ and we can decompose $\Delta_{\Omega}$ into diagonal and Kronecker sum off diagonal components:

[TABLE]

where $\mathrm{diag}(\Delta_{\Omega})=\mathrm{diag}({\Omega}-\Omega_{0})$ and $\mathrm{offd}(\Delta_{\Psi,k})=\mathrm{offd}({\Psi}_{k}-{\Psi}_{0,1})$ . Recall that the $\mathrm{diag}(\Delta_{\Omega})$ and $\mathrm{offd}(\Delta_{\Psi,k})$ terms are all identifiable given $\Delta_{\Omega}\in\mathcal{K}_{\mathbf{p}}$ . Similarly, we can write

[TABLE]

Let $I(\cdot)$ be the indicator function. For an index set $\mathcal{A}$ and a matrix $M=[m_{ij}]$ , define the operator $\mathcal{P}_{\mathcal{A}}(M)\equiv[m_{ij}I((i,j)\in\mathcal{A})]$ that projects $M$ onto the set $\mathcal{A}$ . Let $\Delta_{k,S}=\mathcal{P}_{\mathcal{S}_{k}}(\mathrm{offd}(\Delta_{\Psi,k}))$ be the projection of $\mathrm{offd}(\Delta_{\Psi,k})$ onto the true sparsity pattern of $\Psi_{k}$ . Let ${\mathcal{S}}_{k}^{c}$ be the complement of $\mathcal{S}_{k}$ , and $\Delta_{k,S^{c}}=\mathcal{P}_{{\mathcal{S}}^{c}_{k}}(\mathrm{offd}(\Delta_{\Psi,k}))$ . Furthermore, let

[TABLE]

be the projection of $\Delta_{\Omega}$ onto the sparsity set $\mathcal{S}$ and its complement. Recall neither $\mathcal{S}$ nor $\mathcal{S}^{c}$ includes the diagonal.

We now provide a deterministic bound on the difference in the penalty terms.

Lemma E.15.

Denote by

[TABLE]

Then

[TABLE]

Proof of Lemma E.15. By the decomposability of the $\ell_{1}$ norm and the reverse triangle inequality $|A+B|_{1}\geq|A|_{1}-|B|_{1}$ , we have

[TABLE]

since $\Psi_{k,0}$ is assumed to follow sparsity pattern $\mathcal{S}_{k}$ by (A1). ∎

Let $\mathcal{A}_{0}$ be the event that for some constant $C_{0}$ ,

[TABLE]

and for each $k=1,\dots,K$ , denote by $\mathcal{A}_{k}$ the event such that

[TABLE]

holds for some absolute constant $C_{0}$ which is chosen such that probability statement in Lemma E.16 holds:

Lemma E.16.

Let $\mathcal{A}=\cap_{k=0}^{K}\mathcal{A}_{k}$ as in (46), (45). Then $\mathbb{P}(\mathcal{A})\geq 1-2(K+1)\exp(-c\log p)$ .

Lemma E.16 is proved in Section F. Using the definition of event $\mathcal{A}$ , in Section E.2 we prove the following lemma.

Lemma E.17.

Denote by $\delta_{n,k}=C_{1}\left\lVert\Sigma_{0}\right\rVert_{2}\sqrt{\frac{\log p}{nm_{k}}}$ . Then on event $\mathcal{A}$ the following holds: for all $\Delta_{\Omega}$ as in (42)

[TABLE]

where $C_{0}$ are some absolute constants.

We then have the following lemma, which we prove in Section E.3.

Lemma E.18.

On event $\mathcal{A}$ , we have for $\Delta_{\Omega}\in\mathcal{K}_{\mathbf{p}}$ ,

[TABLE]

where $C_{1}$ is an absolute constant.

E.1 Proof of Theorem 1

Let

[TABLE]

be the difference between the objective function (10) at $\Omega_{0}+\Delta_{\Omega}$ and at $\Omega_{0}$ . Clearly $\widehat{\Delta}_{\Omega}=\widehat{\Omega}-\Omega_{0}$ minimizes $G(\Delta_{\Omega})$ , which is a convex function with a unique minimizer on $\mathcal{K}_{\mathbf{p}}^{\sharp}$ (cf. Theorem B.7). Define

[TABLE]

where for some large enough absolute constant $C$ to be specified,

[TABLE]

In particular, we set $C>9(\max_{k}\frac{1}{\varepsilon_{k}}\vee C_{1})$ for $C_{1}$ as in Lemma E.18.

Proposition E.19 follows from Zhou et al. (2010).

Proposition E.19.

If $G(\Delta)>0$ for all $\Delta\in\mathcal{T}_{n}$ as defined in (49). then $G(\Delta)>0$ for all $\Delta$ in

[TABLE]

for $r_{n,\mathbf{p}}$ (E.1). Hence if $G(\Delta)>0$ for all $\Delta\in\mathcal{T}_{n}$ , then $G(\Delta)>0$ for all $\Delta\in\mathcal{T}_{n}\cup\mathcal{V}_{n}$ .

Proof E.20.

By contradiction, suppose $G(\Delta^{\prime})\leq 0$ for some $\Delta^{\prime}\in\mathcal{V}_{n}$ . Let $\Delta_{0}=\frac{Mr_{n,\mathbf{p}}}{\|\Delta^{\prime}\|_{F}}\Delta^{\prime}$ . Then $\Delta_{0}=\theta\mathbf{0}+(1-\theta)\Delta^{\prime}$ , where $0<1-\theta=\frac{Mr_{n,\mathbf{p}}}{\|\Delta^{\prime}\|_{F}}<1$ by definition of $\Delta_{0}$ . Hence $\Delta_{0}\in\mathcal{T}_{n}$ since by the convexity of the positive definite cone $\Omega_{0}+\Delta_{0}\succ 0$ because $\Omega_{0}\succ 0$ and $\Omega_{0}+\Delta^{\prime}\succ 0$ . By the convexity of $G(\Delta)$ , we have that $G(\Delta_{0})\leq\theta G(\mathbf{0})+(1-\theta)G(\Delta^{\prime})\leq 0$ , contradicting our assumption that $G(\Delta_{0})>0$ for $\Delta_{0}\in\mathcal{T}_{n}$ . ∎

Proposition E.21.

Suppose $G(\Delta_{\Omega})>0$ for all $\Delta_{\Omega}\in\mathcal{T}_{n}$ . We then have that

[TABLE]

Proof E.22.

By definition, $G(0)=0$ , so $G(\widehat{\Delta}_{\Omega})\leq G(0)=0$ . Thus if $G(\Delta_{\Omega})>0$ on $\mathcal{T}_{n}$ , then by Proposition E.19 (section D.1), $\widehat{\Delta}_{\Omega}\notin\mathcal{T}_{n}\cup\mathcal{V}_{n}$ where $\mathcal{V}_{n}$ is defined therein. The proposition results. ∎

Lemma E.23.

Under (A1) - (A3), for all $\Delta\in{\mathcal{T}_{n}}$ for which $r_{n,\mathbf{p}}=o\left(\sqrt{\frac{\min_{k}m_{k}}{K+1}}\right)$ ,

[TABLE]

The proof is in Section E.4.

By Proposition E.21, it remains to show that $G(\Delta_{\Omega})>0$ on $\mathcal{T}_{n}$ under event $\mathcal{A}$ . We show this indeed holds.

Lemma E.24.

On event $\mathcal{A}$ , we have $G(\Delta)>0$ for all $\Delta\in\mathcal{T}_{n}$ .

Proof E.25.

Throughout this proof, we assume that event $\mathcal{A}$ holds. By Lemma E.23, if $r_{n,\mathbf{p}}\leq\sqrt{\min_{k}m_{k}/(K+1)}$ , we can write (48) using the objective (10),

[TABLE]

We next bound the inner product term under event $\mathcal{A}$ . Substituting the bound of Lemma E.17 and (43) into (E.25), under event $\mathcal{A}$ , we have by choice of $\rho_{k}=\delta_{n,p}/\varepsilon_{k}$ where $0<\varepsilon_{k}<1$ for all $k$ ,

[TABLE]

For the diagonal part, we have by Lemma E.18

[TABLE]

we have for all $\Delta_{\Omega}\in\mathcal{T}_{n}$ , and $C^{\prime\prime}=\max_{k}(\frac{2}{\varepsilon_{k}})\vee\sqrt{2}C_{1}$ , and for $K\geq 1$ ,

[TABLE]

which holds for all $\Delta_{\Omega}\in\mathcal{T}_{n}$ , where we use the following bounds: for all $K\geq 1$ .

[TABLE]

and

[TABLE]

where $M=\frac{1}{2\phi_{\min}^{2}(\Sigma_{0})}$ , which holds so long as $C$ is chosen to be large enough in

[TABLE]

For example, we set $C>9C^{\prime\prime}=9(\max_{k}(\frac{2}{\varepsilon_{k}})\vee\sqrt{2}C_{1})$ .

Theorem 1 follows from Proposition E.21 immediately. ∎

E.2 Proof of Lemma E.17

Assume that the event $\mathcal{A}$ of Lemma E.16 holds. Using the definition of $\Delta_{\Omega}$ (42), the projection operator $\mathrm{Proj}_{\widetilde{K}_{\mathbf{p}}}(\cdot)$ , and letting $\tau_{\Sigma}=(K-1)\frac{{\rm tr}(\widehat{S}-\Sigma_{0})}{p}$ , we have

[TABLE]

where we have used the fact that $\mathrm{offd}(\Delta_{\Psi,1})\oplus\dots\oplus\mathrm{offd}(\Delta_{\Psi,K})$ is zero along the diagonal and thus has zero inner product with $I_{p}$ . Substituting Lemma E.18 and the definitions of subevents under $\mathcal{A}$ , we have by (E.2) and Lemma D.13,

[TABLE]

∎

E.3 Proof of Lemma E.18: Bound on Inner Product for Diagonal

Let $\widetilde{\Delta}_{\Omega}=\Delta_{\Omega}-\tau_{\Omega}I_{p}$ . Recall the identifiable parameterization of $\Delta_{\Omega}$ (Lemma D.9)

[TABLE]

where $\tau_{\Omega}={\rm tr}(\Delta_{\Omega})/p$ and $\widetilde{\Delta}_{\Psi,k}$ are given in the lemma. We then have ${\rm tr}(\widetilde{\Delta}_{\Psi,j})=0$ and

[TABLE]

by othogonality of the decomposition. By Lemma D.13, we can write

[TABLE]

Moreover, under $\mathcal{A}_{0}$ , we have

[TABLE]

Summing these terms together, we have

[TABLE]

where in (E.3), we have used the following inequality in view of (54):

[TABLE]

∎

E.4 Proof of Lemma E.23

We first state Proposition E.26

Proposition E.26.

*Under (A1)-(A3), for all $\Delta\in{\mathcal{T}_{n}}$ , *

[TABLE]

so that $\Omega_{0}+v\Delta\succ 0,\forall v\in I\supset[0,1]$ , where $I$ is an open interval containing $[0,1]$ .

Proof E.27.

We first show that (56) holds for $\Delta\in{\mathcal{T}_{n}}$ . Indeed, by Corollary D.11, we have for all $\Delta\in\mathcal{T}_{n}$

[TABLE]

so long as

[TABLE]

where $\kappa(\Sigma_{0})=\phi_{\max}(\Sigma_{0})/\phi_{\min}(\Sigma_{0})$ is the condition number of $\Sigma_{0}$ .

Next, it is sufficient to show that $\Omega_{0}+(1+\varepsilon)\Delta\succ 0$ and $\Omega_{0}-\varepsilon\Delta\succ 0$ for some $1>\varepsilon>0$ . Indeed, for $\varepsilon<1$ ,

[TABLE]

given that by definition of $\mathcal{T}_{n}$ and (56).

Thus we have that $\log|\Omega_{0}+v\Delta|$ is infinitely differentiable on the open interval $I\supset[0,1]$ of $v$ . This allows us to use the Taylor’s formula with integral remainder to prove Lemma E.23, drawn from Rothman et al. (2008).

Let us use $A$ as a shorthand for

[TABLE]

where $\mathrm{vec}(\Delta)\in\mathbb{R}^{p^{2}}$ is $\Delta_{p\times p}$ vectorized. Now, the Taylor expansion gives

[TABLE]

The last inequality holds because $\nabla_{\Omega}\log|\Omega|=\Omega^{-1}$ and $\Omega_{0}^{-1}=\Sigma_{0}$ .

We now bound $a$ , following arguments from (Zhou et al., 2011; Rothman et al., 2008).

[TABLE]

Now, suppose that

[TABLE]

where (56), we have for all $\Delta\in\mathcal{T}_{n}$ ,

[TABLE]

so long as the condition in (A3) holds, namely,

[TABLE]

Hence,

[TABLE]

Thus, substituting into (57), the lemma is proved. ∎

E.5 Proof of Theorem 2: Factorwise and Spectral Norm Bounds

Proof E.28.

Part I: Factor-wise bound. From the proof of Theorem 1, we know that under event $\mathcal{A}$ ,

[TABLE]

Furthermore, since the identifiable parameterizations of $\widehat{\Omega},\Omega_{0}$ are of the form (40) by construction in Lemma D.9)

[TABLE]

we have that the identifiable parameterization of $\Delta_{\Omega}$ is

[TABLE]

where $\tau_{\Delta}=\widehat{\tau}-\tau_{0}$ , $\widetilde{\Delta}_{k}=\widetilde{\Psi}_{k}-\widetilde{\Psi}_{0,k}$ . Observe that ${\rm tr}(\widetilde{\Delta}_{k})={\rm tr}(\widetilde{\Psi}_{k})-{\rm tr}(\widetilde{\Psi}_{0,k})=0$ .

*By Lemma D.9 then, *

[TABLE]

Thus, the estimation error on the underlying parameters is bounded by (58)

[TABLE]

or, dividing both sides by $p$

[TABLE]

Recall that $s=\sum_{k=1}^{K}m_{k}s_{k}$ , so $\frac{s}{p}=\sum_{k=1}^{K}\frac{s_{k}}{d_{k}}$ . Substituting into (60)

[TABLE]

From this, it can be seen that the bound converges as the $m_{k}$ increase with constant $K$ . To put the bound in the form stated in the theorem, note that since $\tau_{\Delta}I_{p}+(\widetilde{\Delta}_{1}^{+}\oplus\dots\oplus\widetilde{\Delta}_{K}^{+})$

[TABLE]

Part II: Spectral norm bound. The factor-wise bound immediately implies the bound on the spectral norm $\|\Delta_{\Omega}\|_{2}$ of the error under event $A$ . We recall the identifiable representation (59)

[TABLE]

By Property 3a in Appendix I and the fact that the spectral norm is upper bounded by the Frobenius norm,

[TABLE]

where in the second line, we have used the fact that for $a_{k}$ elements of $\mathbf{a}\in\mathbb{R}^{K}$ the norm relation $\|\mathbf{a}\|_{1}\leq\sqrt{K}\|\mathbf{a}\|_{2}$ implies $(\sum_{k=1}^{K}|a_{k}|)\leq\sqrt{K}\sqrt{\sum_{k=1}^{K}a_{k}^{2}}$ . ∎

Appendix F Proof of Lemma E.16: Subgaussian Concentration

We first state the following concentration result, proved in Section F.1. Recall that $m_{k}=p/d_{k}$ .

Lemma F.29 (Subgaussian Concentration).

Suppose that $\log p\ll m_{k}n$ for all $k$ . Then, with probability at least $1-2\exp(-c^{\prime}\log p)$ ,

[TABLE]

for all $\Delta\in\mathbb{R}^{d_{k}\times d_{k}}$ , where $c^{\prime}$ is a constant depending on $C$ given in the proof.

We can now prove Lemma E.16.

Proof F.30.

By Lemma F.29 we have that event $\mathcal{A}_{k}$ (46), i.e. the event that

[TABLE]

holds with probability at least $1-2\exp(-c^{\prime}\log p)$ .

Note that $\mathbb{E}[\mathrm{tr}(\widehat{S})]=\mathrm{tr}(\Sigma_{0})$ . Viewing $\frac{1}{p}\mathrm{tr}(\Sigma_{0})$ as a $1\times 1$ covariance factor since $\frac{1}{p}{\rm tr}(\widehat{S})=\frac{1}{pn}\sum_{i=1}^{n}\mathrm{vec}(X_{i})\mathrm{vec}(X_{i})^{T}$ , we can invoke the proof of Lemma F.29 and show that with probability at least $1-2\exp(-c^{\prime}\log p)$ the event $\mathcal{A}_{0}$ (45) will hold. Recall that $\mathcal{A}=\mathcal{A}_{0}\cap\mathcal{A}_{1}\cap\dots\cap\mathcal{A}_{K}$ . By the union bound, we have $\mathbb{P}(\mathcal{A})\geq 1-2(K+1)\exp(-c\log p)$ .∎

F.1 Proof of Lemma F.29

Define a $K$ -way generalization of the invertible Pitsianis-Van Loan type (Van Loan and Pitsianis, 1993) rearrangement operator $\mathcal{R}_{k}(\cdot)$ , which maps $p\times p$ matrices to $d_{k}^{2}\times m_{k}^{2}$ matrices. For a matrix $M\in\mathbb{R}^{p\times p}$ we set

[TABLE]

where we use the $M(i,j|k)\in\mathbb{R}^{d_{k}\times d_{k}}$ subblock notation (see Section 2 in the main text). Using this notation, we have the following concentration result.

Lemma F.31.

Let $\mathbf{u}\in S^{d_{k}^{2}-1}$ and $\mathbf{f}=\mathrm{vec}(I_{m_{k}})$ . Assume that $\mathbf{x}_{t}={\Sigma}_{0}^{1/2}\mathbf{z}_{t}$ where $\mathbf{z}_{t}$ has independent entries $z_{t,f}$ such that $\mathbb{E}z_{t,f}=0$ , $\mathbb{E}z_{t,f}^{2}=1$ , and $\|z_{t,f}\|_{\psi_{2}}\leq K$ . Let ${\Delta}_{n}={{\widehat{S}}}-{\Sigma}_{0}$ . Then for all $0\leq\frac{\epsilon}{\sqrt{m_{k}}}<\frac{1}{2}$ :

[TABLE]

where $c$ is an absolute constant and $\|\cdot\|_{\psi_{2}}$ is the subgaussian norm.

Proof F.32.

We prove the lemma for $k=1$ . The proof for the remaining $k$ follow similarly.

By the definition (63) of the permutation operator $\mathcal{R}_{1}$ and letting $\mathbf{x}_{t}(i)=[x_{t,(i-1)m_{1}+1},\dots,x_{t,im_{1}}]$ ,

[TABLE]

Hence,

[TABLE]

where ${M}={\Sigma}_{0}^{1/2}({U}\otimes{I}_{m_{k}}){\Sigma}_{0}^{1/2}$ , ${U}=\mathrm{vec}^{-1}_{d_{1},d_{1}}(\mathbf{u})$ .

Thus, by the Hanson-Wright inequality (Rudelson et al., 2013),

[TABLE]

since $\|U\otimes{I}_{m_{1}}\|_{2}=\|U\|_{2}\leq 1$ and $\|U\otimes{I}_{m_{1}}\|_{F}^{2}=\|U\|_{F}^{2}\|{I}_{m_{1}}\|_{F}^{2}=m_{1}$ . Substituting $\epsilon=\frac{\tau}{\sqrt{m_{1}}\|\Sigma_{0}\|_{2}}$

[TABLE]

for all $\frac{\epsilon^{2}n}{K^{4}}\leq\frac{\epsilon n\sqrt{m_{1}}}{K^{2}}$ , i.e. $\epsilon\leq K^{2}\sqrt{m_{1}}\leq\frac{\sqrt{m_{1}}}{2},$ since $K^{2}>\frac{1}{2}$ by definition.

∎

We can now prove Lemma F.29.

Proof F.33.

Consider the inner product $\langle{\Delta},{S}_{k}-{\Sigma}_{0}^{(k)}\rangle$ , where ${\Delta}$ is an arbitrary $d_{k}\times d_{k}$ matrix. Let

[TABLE]

By the definition of the factor covariances $S_{k}$ and the rearrangement operator $\mathcal{R}_{k}$ , it can be seen that

[TABLE]

and that similarly by the definition of the factor covariances $\Sigma_{0}^{(k)}$

[TABLE]

Hence,

[TABLE]

by the linearity of the rearrangement operator, the definition of the inner product, and the definition of the unit vector $\mathbf{e}_{i}$ as the $i$ -th column of the $d_{k}^{2}\times d_{k}^{2}$ identity matrix.

We can apply Lemma F.31 and take a union bound over $i=1,\dots,d_{k}^{2}$ . By Lemma F.31,

[TABLE]

for $0\leq\frac{\epsilon}{\sqrt{m_{k}}}\leq\frac{1}{2}$ . Taking the union bound over all $i$ , we obtain

[TABLE]

Setting $\epsilon=C\sqrt{\frac{\log p}{n}}$ for large enough $C$ and recalling that $m_{k}=p/d_{k}$ , with probability at least $1-2\exp(-c^{\prime}\log p)$ we have

[TABLE]

where we assume $\log p\leq\frac{nm_{k}}{4C^{2}}$ and let $c^{\prime}=\frac{cC^{2}}{K^{4}}-2$ . Hence, by (72)

[TABLE]

with probability at least $1-2\exp(-c^{\prime}\log p)$ . The first inequality follows from the triangle inequality and the last inequality from the definition of $\mathbf{h}=\mathrm{vec}(\Delta)$ and $|\cdot|_{1}$ . ∎

Appendix G Nonconvex Regularizers: Proof of Theorem 3

Recall that the support sets $\mathcal{S},\mathcal{S}_{k}$ are the set of nonzero elements of $\Omega_{0}$ and $\Psi_{k,0}$ , respectively. Define $\mathcal{B}$ to be the set of matrices in $\mathcal{K}_{\mathbf{p}}$ with support contained in $\mathcal{S}$ , that is

[TABLE]

The set $\mathcal{B}$ is the set of Kronecker sum matrices following the true sparsity pattern of the Kronecker sum $\Omega_{0}=\Psi_{1,0}\oplus\dots\oplus\Psi_{K,0}$ .

Note that $\mathcal{B}$ is a linear subspace of $\mathbb{R}^{p\times p}$ since $\mathcal{K}_{\mathbf{p}}$ is a linear subspace and the intersection of two linear subspaces is a linear subspace. Hence the (L2 norm) projection $\mathrm{Proj}_{\mathcal{B}}:\mathbb{R}^{p\times p}\rightarrow\mathcal{B}$ onto $\mathcal{B}$ is given by

[TABLE]

where $\mathrm{Proj}_{\mathcal{S}}$ is the linear projection operator projecting $\mathbb{R}^{p\times p}$ onto matrices in $\mathbb{R}^{p\times p}$ with sparsity pattern $\mathcal{S}$ , and $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}$ is the previously defined projection onto $\mathcal{K}_{\mathbf{p}}$ defined in Section 2 of the main text. Note that since the sparsity pattern $\mathcal{S}$ is the sparsity pattern of a Kronecker sum matrix in $\mathcal{K}_{\mathbf{p}}$ , projection onto $\mathcal{S}$ does not change the Kronecker structure.

By reshaping we obtain the representation

[TABLE]

where $\mathcal{P}_{\mathcal{B}}\in\mathbb{R}^{p^{2}\times p^{2}}$ is the projection matrix associated with the linear subspace $\mathcal{B}$ . Recall that $\mathrm{vec}(\cdot)$ is the vectorization operator, and the projection matrix in linear algebra is $UU^{T}$ where $U$ is an orthonormal basis for the subspace.

We first summarize the proof of Theorem 3.

Proof plan: The proof concept is to apply the primal-dual witness technique of Loh et al. (2017) to our sparse Kronecker sum precision matrix estimator. Since the nonconvex graphical lasso proof in Loh et al. (2017) relied on the set of $\mathcal{S}$ sparse matrices being a linear subspace of $\mathbb{R}^{p\times p}$ , we can simply replace the sparse subspace in their proof with our sparse Kronecker sum subspace $\mathcal{B}$ and proceed in a similar fashion. The primal-dual witness technique can be briefly summarized as

(i)

Prove the regularized objective function (8) is strictly convex over the constraint set, so that any zero subgradient point is the unique global minimizer. 2. (ii)

Construct a zero subgradient point of the oracle estimator objective function using Brouwer’s theorem. 3. (iii)

Prove this zero subgradient point $\widehat{\Omega}_{\mathrm{oracle}}$ converges to the true $\Omega_{0}$ . 4. (iv)

Prove that the zero subgradient point of the oracle objective is also a zero subgradient point of the full objective function (8), hence it is the unique global minimizer and converges to $\Omega_{0}$ .

Proceeding with the full proof, we first have the following lemma.

Lemma G.34.

Suppose $g_{\rho}$ is $\mu$ -amenable. Then for $\kappa=\sqrt{\frac{2}{\mu}}$ , the objective function (8) is strictly convex over the constraint set.

Proof G.35.

Recall that

[TABLE]

which is a deterministic quantity not depending on the data. Hence, for $\|\Omega\|_{2}\leq\sqrt{1/\mu}$ , the minimum eigenvalue satisfies

[TABLE]

This implies that $-\log|\Omega|+\langle\widehat{S},\Omega\rangle-\frac{\mu}{2}\|\Omega\|_{F}^{2}$ is convex for $\|\Omega\|_{2}\leq\sqrt{1/\mu}$ . Furthermore, by $\mu$ -amenability, $\sum_{k=1}^{K}m_{k}\sum_{i\neq j}g_{\lambda}({[{\Psi}_{k}]_{ij}})+\frac{\mu}{2}\|\Omega\|_{F}^{2}$ is convex for $\Omega\in\mathcal{K}_{\mathbf{p}}$ . Therefore, since $\mathcal{K}_{\mathbf{p}}$ is a linear subspace, the complete objective (8) is convex for $\|\Omega\|_{2}\leq\sqrt{1/\mu}$ and $\Omega\in\mathcal{K}_{\mathbf{p}}$ . Since it is convex over $\mathcal{K}_{\mathbf{p}}$ , it is convex over $\mathcal{K}_{\mathbf{p}}^{\sharp}$ as well, since $\mathcal{K}_{\mathbf{p}}^{\sharp}$ is the intersection of $\mathcal{K}_{\mathbf{p}}$ and the convex positive definite cone. ∎

Since the objective is convex, a point in the subspace $\mathcal{K}_{\mathbf{p}}$ with zero subgradient will be the unique global minimum. Our first step will be to construct such a zero subgradient point.

We will first construct the (unique) oracle estimate where the oracle gives the support set of $\Omega_{0}$ . We will then show that this oracle estimate is also a zero-subgradient point of the objective (8) and therefore its unique global minimizer.

Using the $\mathcal{B}$ notation, we can write the oracle estimate as

[TABLE]

Our goal will be to construct a map $F:\mathcal{B}\rightarrow\mathcal{B}$ such that (a) $\Delta$ is a fixed point of $F$ if and only if $\Omega_{0}+\Delta$ is a fixed point of the oracle estimate (75), (b) $F$ maps the intersection $\mathcal{B}\cap\mathbb{B}_{\infty}(r)$ of $\mathcal{B}$ and the radius- $r$ $\ell_{\infty}$ -ball centered at the origin to itself for some $r$ , and (c) this $r$ is such that $\Omega=\Omega_{0}+\Delta\succ 0$ , for all $\Delta\in\mathcal{B}\cap\mathbb{B}_{\infty}(r)$ . Then by Brouwer’s fixed point theorem we can show that $F$ must have a fixed point $\Delta_{*}$ in that ball. By construction (a) above, this fixed point $\Delta_{*}$ will correspond to a fixed point $\Omega_{0}+\Delta^{*}$ in the oracle estimator objective, hence the oracle estimate will have $\ell_{\infty}$ -ball error less than $r$ .

For $F$ , we will choose a Newton method step (gradient step preconditioned by inverse Hessian). Denote the pseudoinverse of a matrix $A$ as $A^{\dagger}$ . We now write the map $F:\mathcal{B}\rightarrow\mathcal{B}$ given by

[TABLE]

where $\Delta_{S}\in\mathcal{B}$ , and we let $\Gamma$ be the Hessian of the objective function within $\mathcal{B}$ :555With $\mathcal{P}_{\mathcal{B}}=UU^{T}$ as above (where columns of $U$ form an orthonormal basis for the subspace $\mathcal{B}$ ), $\Gamma=UU^{T}(\Sigma_{0}\otimes\Sigma_{0})UU^{T}$ and hence $\Gamma^{\dagger}=U\left(U^{T}(\Sigma_{0}\otimes\Sigma_{0})U\right)^{-1}U^{T}$ since $\Sigma_{0}$ is positive definite.

[TABLE]

The quantity $\Sigma_{0}\otimes\Sigma_{0}$ is included as it is the Hessian of the objective function (74). The pseudoinverse is needed since $\mathcal{P}_{\mathcal{B}}$ is low rank, making the Hessian within $\mathcal{B}$ low rank.

Clearly if $\mathrm{Proj}_{\mathcal{B}}(\widehat{S}-(\Omega_{0}+\Delta_{S})^{-1})=0$ , $F(\Delta_{S})=\Delta_{S}$ and vice versa, hence $\Delta_{S}$ is a fixed point of $F$ if and only if $\Omega_{0}+\Delta_{S}$ is a fixed point of the oracle objective (75). Now

[TABLE]

since $\Delta_{S}$ has at most $d$ nonzero entries per row. Hence the matrix $\Omega_{0}+\Delta_{S}$ is invertible and positive definite whenever $dr<\lambda_{\min}(\Omega_{0})$ , making $F$ a continuous map on $\mathbb{B}_{\infty}(r)\cap\mathcal{B}$ and satisfying condition (c).

Define the constants $\kappa_{\Gamma}=\|\Gamma^{\dagger}\|_{\infty}$ and $\kappa_{\Sigma}=\|\Sigma_{0}\|_{\infty}$ , in other words, we are assuming that the Hessian is well-conditioned in the $\infty$ -norm sense, which is possible since $\Sigma_{0}$ has eigenvalues bounded from above and below. We now show the following lemma by verifying the remaining condition (b) on $F$ and applying Brouwer’s fixed point theorem. Several relevant quantities are summarized in Table 2 for convenience.

Lemma G.36.

Let $r=2C_{0}\kappa_{\Gamma}\|\Sigma_{0}\|_{2}(K+1)\sqrt{\frac{\log p}{n\min_{k}m_{k}}}$ where $C_{0}$ is a constant depending only on the subgaussian parameter of the data and

[TABLE]

Assume the sample size satisfies $n\min_{k}m_{k}\geq\kappa_{\Gamma}^{2}\log p$ . Then under event $\mathcal{A}$ as in Theorem 1 there exists $\widehat{\Omega}_{\mathrm{oracle}}\in\mathcal{B}$ such that

[TABLE]

Proof G.37.

First, note that $\Gamma^{\dagger}\Gamma\mathrm{vec}(\Delta)=\mathrm{vec}(\Delta)$ for any $\Delta\in\mathcal{B}$ , since $\Gamma$ is the projection of the positive definite matrix $\Sigma_{0}\otimes\Sigma_{0}$ onto the low rank subspace $\mathcal{B}$ .

Suppose $\Delta_{S}\in\mathbb{B}_{\infty}(r)$ . Then

[TABLE]

hence

[TABLE]

by the definition of $\kappa_{\Gamma}$ and the triangle inequality.

The first term of (76) can be bounded via the concentration inequalities used for the $\ell 1$ case. Specifically, note that

[TABLE]

where we have used $\tau_{\Sigma}=\frac{{\rm tr}(\widehat{S})-{\rm tr}(\Sigma_{0})}{p}$ . Now recall that under event $\mathcal{A}_{k}$ , defined in (46) above,

[TABLE]

and under event $\mathcal{A}_{0}$ , defined above in (45),

[TABLE]

Hence under event $\mathcal{A}=\bigcup_{k=0}^{K}\mathcal{A}_{k}$ ,

[TABLE]

Finally, recall that by Lemma E.16 event $\mathcal{A}$ holds with probability $\geq 1-2(K+1)\exp(-c\log p)$ .

Moving on to the second term of (76), we apply the matrix expansion

[TABLE]

and note that (since $\Delta_{S}\in\mathcal{B}$ implies $\mathcal{P}_{\mathcal{B}}\mathrm{vec}(\Delta_{S})=\mathrm{vec}(\Delta_{S})$ )

[TABLE]

where we have used the fact that for symmetric matrices $A,B$ , $\mathrm{vec}(ABA)=(A\otimes A)\mathrm{vec}(B)$ .

We then obtain

[TABLE]

We have used $\mathrm{vec}^{-1}(\cdot)$ to denote the inverse of the vectorization operator.

Via the triangle inequality and the linearity of the vectorization and projection operators,

[TABLE]

Now we can apply Holder’s inequality to obtain

[TABLE]

Then, using the fact that $\|\Delta\|_{2}\leq\|\Delta\|_{\infty}\leq dr$ and substituting back into (80), we have

[TABLE]

Since our assumption implies that $2\kappa_{\Sigma}^{3}dr^{2}\leq r$ , we therefore have that

[TABLE]

*under event $\mathcal{A}$ . Since $F(\mathbb{B}_{\infty}(r)\cap\mathcal{B})\in\mathbb{B}_{\infty}(r)\cap\mathcal{B}$ , by Brouwer’s fixed point theorem (Ortega and Rheinboldt, 1970), $F$ must have a fixed point $\Delta_{S}^{*}$ . Recalling that $\Delta^{*}_{S},\Omega_{0}\in\mathcal{B}$ , we choose $\widehat{\Omega}_{\mathrm{oracle}}=\Omega_{0}+\Delta^{*}_{S}$ . Hence by construction $\|\widehat{\Omega}_{\mathrm{oracle}}-\Omega_{0}\|_{\max}\leq r$ and $\|\widehat{\Omega}_{\mathrm{oracle}}-\Omega_{0}\|_{2}\leq dr$ since both matrices have degree bounded by $d$ .

The last equality follows since $\Delta_{S}^{*}$ is the fixed point of $F$ , i.e. $F(\Delta_{S}^{*})=\Delta_{S}^{*}$ , which can only occur if*

[TABLE]

∎

Using this lemma it remains to show that $\widehat{\Omega}_{\mathrm{oracle}}$ satisfies the constraints and is a zero-subgradient point of the complete objective (8), and hence is the unique global optimum.

Define $\mathcal{L}_{n}(\Omega)$ to be the objective function (8) less the regularization terms, i.e.

[TABLE]

Lemma G.38.

The oracle estimate $\widehat{\Omega}_{\mathrm{oracle}}$ will be a zero-subgradient point of the global objective (8) if the inequalities

[TABLE]

and

[TABLE]

hold, where

[TABLE]

We have denoted $\nabla_{\mathcal{K}_{\mathbf{p}}}f=\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\nabla f$ and $\nabla^{2}_{\mathcal{K}_{\mathbf{p}}}f=\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}(\nabla^{2}f)\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}$ to be the gradient and Hessian respectively of $f$ projected onto the subspace $\mathcal{K}_{\mathbf{p}}$ ( $\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}$ is the projection matrix onto $\mathcal{K}_{\mathbf{p}}$ ).

Proof G.39.

In this proof, for simplicity we write $q_{\rho}(\widehat{\Omega})$ to indicate $q_{\rho}(t)=g_{\rho}(t)-\rho|t|$ applied elementwise to the offdiagonal elements of $\widehat{\Omega}$ :

[TABLE]

Observe that by construction $\nabla_{\mathcal{K}_{\mathbf{p}}}q_{\rho}({\Omega})=\nabla_{\mathbb{R}^{p\times p}}q_{\rho}({\Omega})=\nabla q_{\rho}(\Omega)$ for any ${\Omega}\in\mathcal{K}_{\mathbf{p}}$ .

For the objective (8), the zero subgradient condition is given by

[TABLE]

where $\widehat{z}=\partial|\widehat{\Omega}|_{1,\mathrm{off}}$ is an element of the subgradient of the off-diagonal $\ell$ 1 norm at $\widehat{\Omega}$ . Adding and subtracting $\nabla_{\mathcal{K}_{\mathbf{p}}}\mathcal{L}_{n}(\Omega_{0})$ gives

[TABLE]

By the fundamental theorem of calculus we have (for $\widehat{\Omega}=\widehat{\Omega}_{\mathrm{oracle}}$ ) that $\nabla_{\mathcal{K}_{\mathbf{p}}}\mathcal{L}_{n}(\widehat{\Omega}_{\mathrm{oracle}})-\nabla_{\mathcal{K}_{\mathbf{p}}}\mathcal{L}_{n}(\Omega_{0})=\widehat{Q}\mathrm{vec}(\widehat{\Omega}_{\mathrm{oracle}}-\Omega_{0})$ , hence

[TABLE]

*Rewriting in block form gives *

[TABLE]

where $\widehat{Q}_{SS}$ is the block of $\widehat{Q}$ corresponding to the elements in $\mathcal{S}$ along both axes, $\widehat{Q}_{S^{c}S^{c}}$ is the block of $\widehat{Q}$ corresponding to the elements in the complement of $\mathcal{S}$ , etc. After some algebra we obtain a solution

[TABLE]

since $\nabla q_{\rho}(0)=0$ by definition. Now from Lemma G.36, under event $\mathcal{A}$

[TABLE]

and observe that $\rho\gamma>r$ since we have assumed that $n\min_{k}m_{k}\geq c_{0}d^{2}\log p$ for some $c_{0}$ large enough. By our assumption that $|[\Omega_{0}]_{ij}|\geq\rho\gamma+r$ for all $i,j$ , we then have (again under event $\mathcal{A}$ )

[TABLE]

Therefore, using condition (f) of the definition of a $(\mu,\gamma)$ regularizer, $-\nabla q_{\rho}(\widehat{\Omega}_{\mathrm{oracle}})_{S}+\rho\widehat{z}_{S}=0$ and

[TABLE]

where we have applied the assumed inequalities. Since $\|\widehat{z}_{S^{c}}\|_{\infty}\leq 1$ , it is a feasible subgradient and therefore $\widehat{\Omega}_{\mathrm{oracle}}$ is a zero subgradient point of the global objective function (8).

∎

We now show the inequalities (81), (82) assumed by Lemma G.38 hold under event $\mathcal{A}$ . Note that

[TABLE]

and thus by (78), under event $\mathcal{A}$ equation (81) holds with $\rho=\frac{r}{\kappa_{\Gamma}}.$

It remains to show (82) holds with $\rho=r$ . We will first bound

[TABLE]

and then show that the expression on the left hand side of (82) is close to this quantity.

First, by the definition of the infinity norm and $\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}$ it can be shown that

[TABLE]

where we have used the expression (99) for the elements of the projected matrix and the fact that an average of a set of elements of $A$ cannot have magnitude larger than $\|A\|_{\max}$ . Noting that $(\nabla_{\mathcal{K}_{\mathbf{p}}}^{2}\mathcal{L}_{n}(\Omega_{0}))_{SS}^{{\dagger}}=(\Gamma^{{\dagger}})_{SS}$ ,

[TABLE]

since $\|(\nabla_{\mathcal{K}_{\mathbf{p}}}^{2}\mathcal{L}_{n}(\Omega_{0}))_{S^{c}S}\|_{\infty}\leq\|\nabla_{\mathcal{K}_{\mathbf{p}}}^{2}\mathcal{L}_{n}(\Omega_{0})\|_{\infty}=\left\|\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\left(\nabla^{2}\mathcal{L}_{n}(\Omega_{0})\right)\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\right\|_{\infty}\leq\|\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\|_{\infty}^{2}\|\nabla^{2}\mathcal{L}_{n}(\Omega_{0})\|_{\infty}=\|\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\|_{\infty}^{2}\|\Sigma_{0}\otimes\Sigma_{0}\|_{\infty}=\|\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\|_{\infty}^{2}\|\Sigma_{0}\|_{\infty}^{2}\leq 4K^{2}\kappa_{\Sigma}^{2}$ , and (81) holds under event $\mathcal{A}$ with $\rho=\frac{r}{\kappa_{\Gamma}}$ .

We now relate the bound in (84) to that required to show (82). Note that

[TABLE]

where we have defined

[TABLE]

Now, again invoking (81),

[TABLE]

The infinity norm of $\Xi$ can be bounded as

[TABLE]

where we have set

[TABLE]

First note that by (83)

[TABLE]

and $\|(\widehat{Q}_{SS})^{\dagger}\|_{\infty}=O(1+\delta_{2})$ by the definition of $\delta_{2}$ .

Substituting into (87) gives

[TABLE]

We bound $\delta_{1}$ and $\delta_{2}$ in the following lemma, proved in Section G.1.

Lemma G.40.

Under the conditions of Lemma G.36,

[TABLE]

Applying Lemma G.40 to (88), we obtain

[TABLE]

and substituting into (86)

[TABLE]

since $dr=o(1)$ by our assumption that $n\min_{k}m_{k}\geq c_{0}d^{2}\log p$ for some $c_{0}$ large enough.

Therefore, substituting into (85) we obtain

[TABLE]

proving the desired condition (82) holds. Hence the conditions of Lemma G.38 hold under event $\mathcal{A}$ , and $\widehat{\Omega}_{\mathrm{oracle}}$ is the unique global minimizer of the complete objective (8).

The Frobenius and spectral norm bounds follow from the identities

[TABLE]

and

[TABLE]

where the latter identity follows by symmetry of $\Omega$ .

G.1 Proof of Lemma G.40: Bound on $\delta_{1},\delta_{2}$

Proof G.41.

Consider that

[TABLE]

Hence, since $\|\mathcal{P}_{\mathcal{K}_{\mathbf{p}}}\|_{\infty}\leq 2K$ by (83),

[TABLE]

By Lemma G.36, for $t\in[0,1]$ ,

[TABLE]

We make use of the following matrix inequalities (Loh et al., 2017). For any invertible $A,B\in\mathbb{R}^{p\times p}$ and matrix norm $\|\cdot\|$ ,

[TABLE]

if $\|A^{-1}\|\|A-B\|\leq 1/2$ . For any $A$ and $B$ matrices of equal dimension we have

[TABLE]

Applying (89) we get

[TABLE]

since $\|\Omega_{0}^{-1}\|_{\infty}=\|\Sigma_{0}\|_{\infty}$ is bounded by $\kappa_{\Sigma}$ . Applying (90) to this yields

[TABLE]

which gives

[TABLE]

and

[TABLE]

Finally, recall that the projection matrix onto $\mathcal{K}_{\mathbf{p}}$ can be written as $UU^{T}$ with $U^{T}U=I$ so

[TABLE]

By the matrix expansion (79) we then have

[TABLE]

We can then use the bound (91) to obtain

[TABLE]

since $dr=o(1)$ .

∎

Appendix H Numerical Convergence of TG-ISTA

The following theorem shows that the iterates of the TG-ISTA implementation of TeraLasso converge geometrically to the global minimum:

Theorem H.42.

Let $\rho_{k}\geq 0$ for all $k$ and let $\Omega_{\mathrm{init}}$ be the initialization of the TG-ISTA implementation of TeraLasso (Algorithm 4). Let

[TABLE]

and assume $\zeta_{t}\leq a^{2}$ for all $t$ . Suppose further that $\Omega^{*}$ is the global optimum. Then

[TABLE]

Furthermore, the step size $\zeta_{t}$ which yields an optimal worst-case contraction bound $s(\zeta_{t})$ is $\zeta=\frac{2}{a^{-2}+b^{-2}}$ . The corresponding optimal worst-case contraction bound is

[TABLE]

Our proof uses results on the structure of the Kronecker sum subspace to extend to our subspace restricted setting the methodology that Guillot et al. (2012) used to derive the unstructured GLasso convergence rates.

We decompose the claims of Theorem H.42 into the following two theorems which we prove separately.

Theorem H.43.

Assume that the iterates $\Omega_{t}$ of Algorithm 4 satisfy $aI\preceq\Omega_{t}\preceq bI$ , for all $t$ , for some fixed constants $0<a<b<\infty$ . Suppose further that $\Omega^{*}$ is the global optimum. If $\zeta_{t}\leq a^{2}$ for all $t$ , then

[TABLE]

Furthermore, the step size $\zeta_{t}$ which yields an optimal worst-case contraction bound $s(\zeta_{t})$ is $\zeta=\frac{2}{a^{-2}+b^{-2}}$ . The corresponding optimal worst-case contraction bound is

[TABLE]

Theorem H.44.

Let $\rho_{k}\geq 0$ for all $k$ and let $\Omega_{\mathrm{init}}$ be the initialization of the TG-ISTA implementation of TeraLasso (Algorithm 4). Let

[TABLE]

and assume $\zeta_{t}\leq a^{2}$ for all $t$ . Then the iterates $\Omega_{t}$ of Algorithm 4 satisfy $aI\preceq\Omega_{t}\preceq bI$ for all $t$ .

Observe that by Theorem H.44, the worst case contraction factor (93)

[TABLE]

scales at most as $s(\zeta)=O(1-\frac{2}{1+K^{2}})$ for $\|\Omega^{*}\|_{2},\|\Sigma_{0}\|_{2}$ of fixed order, since $\|S_{k}\|_{2}\sim\|\Sigma_{0}\|_{2}$ with high probability.

Let $T$ be the number of iterations required for $\|\Omega_{T}-\Omega^{*}\|_{F}\leq\|\Omega^{*}-\widehat{\Omega}\|_{F}$ to hold, i.e. for the optimization error to be smaller than the statistical error. By Theorem 1, we require

[TABLE]

Using worst case contraction factor $s(\zeta)$ , (94) will hold for $T$ such that (with high probability)

[TABLE]

Taking the logarithm of both sides and using $s(\zeta)=O(1-\frac{2}{1+K^{2}})$ , we have that the optimization error is guaranteed to equal the statistical error after $T$ iterations, where

[TABLE]

H.1 Proof of Theorem H.43

For convenience, define the Kronecker sum shrinkage operator as

[TABLE]

for $A=A^{(1)}\oplus\dots\oplus A^{(K)}\in\mathcal{K}_{\mathbf{p}}$ and $\mathbf{\rho}=[\rho_{1},\dots,\rho_{K}]$ with all $\rho_{k}\geq 0$ . Note that $\mathrm{shrink}^{-}_{\rho}(A)=\arg\min_{\Omega\in\mathcal{K}_{\mathbf{p}}}\left\{\frac{1}{2}\left\|\Omega-A\right\|_{F}^{2}+\sum_{k=1}^{K}m_{k}\rho_{k}|{\Psi}_{k}|_{1,{\rm off}}\right\}$ . Since $\sum_{k=1}^{K}m_{k}\rho_{k}|{\Psi}_{k}|_{1,{\rm off}}$ is a convex function on $\mathcal{K}_{\mathbf{p}}$ , and since $\mathcal{K}_{\mathbf{p}}$ is a linear subspace, $\mathrm{shrink}^{-}_{\epsilon}(\cdot)$ is a proximal operator by definition.

Recall that we can write the TG-ISTA update (27) using this Kronecker sum shrinkage operator as

[TABLE]

where $\widehat{S}$ is the sample covariance (3) and $\widetilde{S}=\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(\widehat{S})$ is its projection onto $\mathcal{K}_{\mathbf{p}}$ (5.2).

By convexity in $\mathcal{K}_{\mathbf{p}}$ and Theorem B.7, the optimal point $\Omega^{*}_{\rho}$ is a fixed point of the ISTA iteration (Combettes and Wajs (2005), Prop 3.1). Thus,

[TABLE]

Since proximal operators are not expansive (Combettes and Wajs, 2005), we have

[TABLE]

For $\gamma>0$ define $h_{\gamma}:\mathcal{K}_{\mathbf{p}}^{\sharp}\rightarrow\mathcal{K}_{\mathbf{p}}^{\sharp}$ by

[TABLE]

Since $\partial\Omega^{-1}/\partial\Omega=-\Omega^{-1}\otimes\Omega^{-1}$ ,

[TABLE]

where $P$ is the projection matrix that projects $\mathrm{vec}(\Omega)$ onto the vectorized subspace $\mathcal{K}_{\mathbf{p}}$ . Thus, we have the Jacobian (valid for all $\Omega\in\mathcal{K}_{\mathbf{p}}^{\sharp}$ )

[TABLE]

Recall that if $h:U\subset\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is a differentiable mapping, then if $x,y\in U$ and $U$ is convex, then if $J_{h}(\cdot)$ is the Jacobian of $h$ ,

[TABLE]

Thus, letting $Z_{t,c}=\mathrm{vec}(c\Omega_{t}+(1-c)\Omega^{*}_{\rho})$ , for $c\in[0,1]$ we have

[TABLE]

By Weyl’s inequality, $\lambda_{\max}(Z_{t,c})\leq\max\{\|\Omega_{t}\|,\|\Omega_{\rho}^{*}\|\}$ and

[TABLE]

Furthermore, note that for any $Y$ and projection matrix $P$

[TABLE]

We then have

[TABLE]

where the latter inequality comes from (Guillot et al., 2012). Thus,

[TABLE]

as desired. Algorithm 4 will then converge if $s(\zeta_{t})\in(0,1)$ for all $t$ . The minimum of $s(\zeta)$ occurs at $\zeta=\frac{2}{a^{-2}+b^{-2}}$ , completing the proof of Theorem H.43. ∎

H.2 Proof of Theorem H.44

We first prove the following properties of the Kronecker sum projection operator.

Lemma H.45.

For any $A\in\mathbb{R}^{p\times p}$ and orthogonal matrices $U_{k}\in\mathbb{R}^{d_{k}\times d_{k}}$ , let $U=U_{1}\otimes\dots\otimes U_{K}\in\mathcal{K}_{\mathbf{p}}$ . Then

[TABLE]

Furthermore, if the eigendecomposition of $A$ is of the form $A=(U_{1}\otimes\dots\otimes U_{K})\Lambda(U_{1}\otimes\dots\otimes U_{K})^{T}$ with $\Lambda=\mathrm{diag}(\lambda_{1},\dots,\lambda_{p})$ , we have

[TABLE]

and

[TABLE]

Proof H.46.

Recall

[TABLE]

since $U^{T}AU=\Lambda$ and the Frobenius norm is unitarily invariant. Now, note that for any matrix $B=B_{1}\oplus\dots\oplus B_{K}\in\mathcal{K}_{\mathbf{p}}$ ,

[TABLE]

since $U_{k}^{T}I_{d_{k}}U_{k}=I_{d_{k}}$ . Since $U^{T}BU\in\mathcal{K}_{\mathbf{p}}$ , the constraint $B\in\mathcal{K}_{\mathbf{p}}$ can be moved to $C=U^{T}BU$ , giving

[TABLE]

If $A=(U_{1}\otimes\dots\otimes U_{K})\Lambda(U_{1}\otimes\dots\otimes U_{K})^{T}$ , then $U^{T}AU=\Lambda$ , completing the first part of the proof. As shown in Lemma I.55, $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(\Lambda)$ is a diagonal matrix whose entries are weighted averages of the diagonal elements $\lambda_{i}$ . Hence

[TABLE]

Since $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(\Lambda)$ gives the eigenvalues of $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(A)$ by the orthogonality of $U$ , this completes the proof. ∎

Lemma H.47.

Let $0<a<b$ be given positive constants and let $\zeta_{t}>0$ . Assume $aI\preceq\Omega_{t}\preceq bI$ . Then for

[TABLE]

we have

[TABLE]

Proof H.48.

*Let $U\Gamma U^{T}=\Omega_{t}$ be the eigendecomposition of $\Omega_{t}$ , where $\Gamma=\mathrm{diag}(\gamma_{1},\dots,\gamma_{p})$ . Then all $b\geq\gamma_{i}\geq a>0$ . Since $\Omega_{t}\in\mathcal{K}_{\mathbf{p}}$ , by the eigendecomposition property in Appendix I we have $U=U_{1}\otimes\dots\otimes U_{K}$ and $\Gamma\in\mathcal{K}_{\mathbf{p}}$ , letting us apply Lemma H.45: *

[TABLE]

*where we set $\widetilde{\Omega}_{t+1/2}=U(\Gamma+\zeta\Gamma^{-1}-\zeta_{t}(U^{T}\widehat{S}U))U^{T}$ and recall the linearity of the projection operator $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(\cdot)$ (Lemma I.55). By Weyl’s inequality, *

[TABLE]

By Lemma H.45,

[TABLE]

Note that the only extremum of the function $f(x)=x+\frac{\zeta_{t}}{x}$ over $a\leq x\leq b$ is a global minimum at $x=\sqrt{\zeta_{t}}$ . Hence

[TABLE]

By our assumption, $a\leq\gamma_{1}\leq b$ . Thus

[TABLE]

as desired, completing the proof. ∎

We then have the following lemma.

Lemma H.49.

For $A\in\mathcal{K}_{\mathbf{p}}^{\sharp}$ and $\mathbf{\epsilon}=[\epsilon_{1},\dots,\epsilon_{K}]$ with $\epsilon_{k}\geq 0$ :

[TABLE]

Proof H.50.

Since by definition (95)

[TABLE]

we can use the fact that the eigenvalues of a Kronecker sum are the sums of the eigenvalues to show

[TABLE]

We have used the fact that $A$ is positive definite since it is in $\mathcal{K}_{\mathbf{p}}^{\sharp}$ .

Via Weyl’s inequality and the proof of Lemma 6 in (Guillot et al., 2012),

[TABLE]

Hence,

[TABLE]

∎

H.2.1 Proof of Theorem H.44

To prove the lower inequality in Theorem H.44, we show the following.

Lemma H.51.

Let $\rho=[\rho_{1},\dots,\rho_{K}]$ with all $\rho_{i}>0$ . Define

[TABLE]

and let $\alpha=\frac{1}{\|\widehat{S}\|_{2}+\chi}<b^{\prime}$ . Assume $\alpha I\preceq\Omega_{t+1}$ . Then $\alpha I\preceq\Omega_{t+1}$ for every $0<\zeta_{t}<\alpha^{2}$ .

Proof H.52.

Since $\zeta_{t}<\alpha^{2}$ , $\sqrt{\zeta_{t}}\notin[\alpha,b^{\prime}]$ , and $\min\left(\alpha+\frac{\zeta_{t}}{\alpha},b^{\prime}+\frac{\zeta_{t}}{b^{\prime}}\right)=\alpha+\frac{\zeta_{t}}{\alpha}$ . Lemma H.47 then implies that

[TABLE]

By Lemma H.49,

[TABLE]

Hence, since $\zeta_{t}>0$ , $\lambda_{\min}(\Omega_{t+1})\geq\alpha$ whenever

[TABLE]

∎

The upper bound in Theorem H.44 results from the following lemma.

Lemma H.53.

Let $\chi$ be as in Lemma H.51 and let $\alpha=\frac{1}{\|\widehat{S}\|_{2}+\chi}$ . Let $\zeta_{t}\leq\alpha^{2}$ for all $t$ . We then have $\Omega_{t}\preceq b^{\prime}I$ for all $t$ when $b^{\prime}=\|\Omega_{\rho}^{*}\|_{2}+\|\Omega_{0}-\Omega_{\rho}^{*}\|_{F}$ .

Proof H.54.

By Lemma H.51, $\alpha I\preceq\Omega_{t}$ for every $t$ . Since $\Omega_{t}\rightarrow\Omega_{\rho}^{*}$ , by strong convexity $\alpha I\preceq\Omega_{\rho}^{*}$ . Hence $a=\min\{\lambda_{\min}(\Omega_{t}),\lambda_{\min}(\Omega^{*}_{\rho})\}\geq\alpha$ . For $b>a$ and $\zeta_{t}\leq\alpha^{2}$ ,

[TABLE]

Hence, by Theorem H.42 $\|\Omega_{t}-\Omega^{*}_{\rho}\|_{F}\leq\|\Omega_{t-1}-\Omega^{*}_{\rho}\|_{F}\leq\|\Omega_{0}-\Omega^{*}_{\rho}\|_{F}$ . Thus

[TABLE]

so

[TABLE]

∎

This completes the proof of Theorem H.44. ∎

Appendix I Useful Properties of the Kronecker Sum and $\mathcal{K}_{\mathbf{p}}$

I.1 Basic Properties

As the properties of Kronecker sums are not always widely known, we have compiled a list of some fundamental algebraic relations we use.

Sum or difference of Kronecker sums (Laub, 2005):

[TABLE] 2. 2.

Factor-wise disjoint off diagonal support (Laub, 2005). By construction, if for any $k$ and $i\neq j$

[TABLE]

then for all $\ell\neq k$

[TABLE]

Thus,

[TABLE] 3. 3.

Eigendecomposition: If $A_{k}=U_{k}\Lambda_{k}U_{k}^{T}$ are the eigendecompositions of the factors, then (Laub, 2005)

[TABLE]

is the eigendecomposition of $A_{1}\oplus\dots\oplus A_{K}$ . Some resulting identities useful for doing numerical calculations are as follows:

(a)

L2 norm:

[TABLE] 2. (b)

Determinant:

[TABLE] 3. (c)

Matrix powers (e.g. inverse, inverse square root):

[TABLE]

Since the $\Lambda_{k}$ are diagonal, this calculation is memory and computation efficient.

I.2 Eigenstructure of $\Omega\in\mathcal{K}_{\mathbf{p}}$

Kronecker sum matrices $\Omega\in\mathcal{K}_{\mathbf{p}}$ have Kronecker product eigenvectors with linearly related eigenvalues, as contrasted to the multiplicatively related eigenvalues in the Kronecker product. For simplicity, we illustrate in the $K=2$ case, but the result generalizes to the full tensor case. Suppose that $\Psi_{1}=U_{1}\Lambda_{1}U_{1}^{T}$ and $\Psi_{2}=U_{2}\Lambda_{2}U_{2}^{T}$ are the eigendecompositions of $\Psi_{1}$ and $\Psi_{2}$ . Then by Laub (2005), if $\Omega=\Psi_{1}\oplus\Psi_{2}$ , the eigendecomposition of $\Omega$ is

[TABLE]

Thus, the eigenvectors of the Kronecker sum are the Kronecker products of the eigenvectors of each factor. This “block” structure is evident in the inverse Kronecker sum example in Section 1 of the main text. The structure of $\Omega^{-1}$ is discussed further in Canuto et al. (2014).

This eigenstructure representation parallels the eigenvector structure of the Kronecker product - specifically when $\Omega=\Psi_{1}\otimes\Psi_{2}$

[TABLE]

Hence, use of the Kronecker sum model can be viewed as replacing the nonconvex, multiplicative eigenvalue structure of the Kronecker product with the convex linear eigenvalue structure of the Kronecker sum. This additive structure results in relatively more stable estimation of the precision matrix. As the tensor dimension $K$ increases, this structural stability of the Kronecker sum as compared to the Kronecker product becomes more dominant ( $K$ term sums instead of $K$ -order products).

I.3 Projection onto $\mathcal{K}_{\mathbf{p}}$

We first introduce a submatrix notation. Fix a $k$ , and choose $i,j\in\{1,\dots m_{k}\}$ . Let $E_{1}\in\mathbb{R}^{\prod_{\ell=1}^{k-1}d_{k}\times\prod_{\ell=1}^{k-1}d_{k}}$ and $E_{2}\in\mathbb{R}^{\prod_{\ell=k+1}^{K}d_{k}\times\prod_{\ell=k+1}^{K}d_{k}}$ be such that $[E_{1}\otimes E_{2}]_{ij}=1$ with all other elements zero. Observe that $E_{1}\otimes E_{2}\in\mathbb{R}^{m_{k}\times m_{k}}$ . For any matrix $A\in\mathbb{R}^{p\times p}$ , let $A(i,j|k)\in\mathbb{R}^{d_{k}\times d_{k}}$ be the submatrix of $A$ defined via

[TABLE]

The submatrix $A(i,j|k)$ is defined for all $i,j\in\{1,\dots m_{k}\}$ and $k=1,\dots,K$ . When $A$ is a covariance matrix associated with a tensor $X$ , this subblock corresponds to the covariance matrix between the $i$ th and $j$ th slices of $X$ along the $k$ th dimension.

We can now express the projection operator $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(A)$ in closed form:

Lemma I.55 (Projection onto $\mathcal{K}_{\mathbf{p}}$ ).

*For any $A\in\mathbb{R}^{p\times p}$ , *

[TABLE]

where

[TABLE]

Since the submatrix operator $A(i,i|k)$ is clearly linear, $\mathrm{Proj}_{\mathcal{K}_{\mathbf{p}}}(\cdot)$ is a linear operator.

Proof I.56.

Since $\mathcal{K}_{\mathbf{p}}$ is a linear subspace, projection can be found via inner products. Specifically, recall that if a subspace $\mathcal{A}$ is spanned by an orthonormal basis $U$ , then

[TABLE]

Since $\mathcal{K}_{\mathbf{p}}$ is the space of Kronecker sums, the off diagonal elements are independent and do not overlap across factors. The diagonal portion is more difficult as each factor overlaps on the same entries, creating an overdetermined system. We can create an alternate parameterization of $\mathcal{K}_{\mathbf{p}}$ :

[TABLE]

where we constrain ${\rm tr}(\bar{A}_{k})=0$ . Each of the $K+1$ terms in this sum is now orthogonal to all other terms since by construction

[TABLE]

for $\ell\neq k$ and all possible $\bar{A}_{k}$ , $\tau_{A}$ . Thus, we can form bases for the $\bar{A}_{k}$ and $\tau_{A}$ independently. To find the $\bar{A}_{k}$ it suffices to project $A$ onto a basis for $\bar{A}_{k}$ . We can divide this projection into two steps. In the first step, we ignore the constraint on ${\rm tr}(\bar{A}_{k})$ and create the orthonormal basis

[TABLE]

for all $i,j=1,\dots d_{k}$ . Recall that in a projection of $\mathbf{x}$ , the coefficient of a basis component $\mathbf{u}$ is given by $\mathbf{u}^{T}\mathbf{x}=\langle\mathbf{u},\mathbf{x}\rangle$ . We can thus apply this elementwise to the projection of $A$ . Hence projecting $A$ onto these basis components yields a matrix $B\sqrt{m_{k}}\in\mathbb{R}^{d_{k}\times d_{k}}$ where

[TABLE]

To enforce the ${\rm tr}(\bar{A}_{k})=0$ constraint, we project away from $B$ the one-dimensional subspace spanned by $I_{d_{k}}$ . This projection is given by

[TABLE]

where by construction

[TABLE]

Equation (98) completes the projection onto a basis for $\bar{A}_{k}$ , so we can expand the projection $\sqrt{m_{k}}B$ back into the original space. This yields a $\bar{A}_{k}$ of the form

[TABLE]

Finally, for $\tau_{A}$ we can compute

[TABLE]

Combining all these together and substituting into (97) allows us to define the projection in terms of matrices $\widetilde{A}_{k}$ , where we split the $\tau_{A}I_{p}$ term evenly across the other $K$ factors. Specifically

[TABLE]

where

[TABLE]

An equivalent representation using factorwise averages is

[TABLE]

where

[TABLE]

Moving the trace corrections to a last term and putting the result in terms of the $A_{k}$ yields the lemma.

In Algorithm 4 we use an efficient method of computing this projected inverse in our setting by exploiting the eigendecomposition identities in Section I.2. ∎

Appendix J Known diagonal elements (correlation matrix form)

In the case where the diagonal $\mathrm{diag}(\Omega_{0})$ of the precision matrix is known a priori, the estimation problem becomes easier. For simplicity, we consider the case that $\Omega_{0}$ is in the form of a correlation matrix, i.e. $\mathrm{diag}(\Omega_{0})={I}_{p}$ , noting this was the setting originally the focus of Kalaitzis et al. (2013).

Note that since the diagonal elements are known, we do not need to estimate them and indeed can set all the $\mathrm{diag}(\Psi_{k})=1/KI_{d_{k}}$ . Revisiting the proof of Theorem 1, it is easy to show the following corollary, which shows strong $O(\sqrt{(K+1)s\frac{\log p}{n\min_{k}m_{k}}})$ convergence in the case of $\ell 1$ regularization. This replacement of the $\sqrt{p+s}$ term in rate of Theorem 1 with a $\sqrt{s}$ guarantees single sample convergence in the sparse setting when $\min_{k}m_{k}\gg s$ .

Corollary J.57.

Suppose the conditions of Theorem 1, and that $\mathrm{diag}(\Omega_{0})={I}_{p}$ is known. Then under event $\mathcal{A}$ ,

[TABLE]

Furthermore, event $\mathcal{A}$ holds with probability at least $1-2(K+1)\exp(-c\log p)$ .

Proof J.58.

Dropping the diagonal term from the proof of Lemma E.24, we have that the $\sqrt{p}$ dependence vanishes, and on event $\mathcal{A}$ , we have $G(\Delta)>0$ for all $\Delta\in\mathcal{T}_{n}$ where

[TABLE]

and

[TABLE]

The rest of the proof follows by substituting this new value of $r_{n,\mathbf{p}}$ into the proof of Theorem 1.

Appendix K SCAD and MCP regularizers

The SCAD penalty (Fan and Li, 2001) with parameter $a>2$ (giving $\mu=1/(a-1)$ ) is given by

[TABLE]

which is linear (as the $\ell$ 1 norm) for small $|t|$ , constant for large $|t|$ , and has a transition between the two regimes for moderate $|t|$ .

The MCP penalty (Zhang et al., 2010) with parameter $a>0$ (giving $\mu=1/a$ ) is given by

[TABLE]

giving a more smooth transition between the approximately linear region and the constant region ( $t>\rho a$ ).

Bibliography67

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen and Tibshirani (2010) Allen, G. I. and Tibshirani, R. (2010) Transposable regularized covariance models with an application to missing data imputation. The Annals of Applied Statistics , 4 , 764–790.
2Andrianov (1997) Andrianov, S. N. (1997) A matrix representation of lie algebraic methods for design of nonlinear beam lines. In AIP Conference Proceedings , vol. 391, 355–360. AIP.
3Augustin et al. (2009) Augustin, N. H., Musio, M., von Wilpert, K., Kublin, E., Wood, S. N. and Schumacher, M. (2009) Modeling spatiotemporal forest health monitoring data. Journal of the American Statistical Association , 104 , 899–911.
4Banerjee et al. (2008) Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008) Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research , 9 , 485–516.
5Barzilai and Borwein (1988) Barzilai, J. and Borwein, J. M. (1988) Two-point step size gradient methods. IMA Journal of Numerical Analysis , 8 , 141–148.
6Beck and Teboulle (2009) Beck, A. and Teboulle, M. (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences , 2 , 183–202.
7Beckermann et al. (2013) Beckermann, B., Kressner, D. and Tobler, C. (2013) An error analysis of galerkin projection methods for linear systems with tensor product structure. SIAM Journal on Numerical Analysis , 51 , 3307–3326.
8Boyd and Vandenberghe (2009) Boyd, S. and Vandenberghe, L. (2009) Convex optimization . Cambridge university press.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Tensor Graphical Lasso (TeraLasso)

Abstract

1 Introduction

1.1 The multi-way Kronecker sum precision matrix model

1.2 Relevant prior work

1.3 Rationale for the proposed multiway Kronecker sum model

2 Notation and Preliminaries

3 Models and Methods

Definition 1** ((μ,γ)(\mu,\gamma)(μ,γ) amenable regularizer)**

4 High Dimensional Consistency of the TeraLasso

4.1 Regularization with ℓ1\ell 1ℓ1 penalty

Theorem 1** (Frobenius error bound)**

Theorem 2** (Factorwise and L2 error bounds)**

4.2 Nonconvex Regularizers and Single Sample Support Recovery

Theorem 3** (Nonconvex Regularizers)**

5 TG-ISTA Algorithm

5.1 Composite gradient descent and proximal first order methods

5.2 TG-ISTA implementation of TeraLasso

Lemma 4** (Decomposition of objective)**

5.3 TG-ISTA for a nonconvex regularizer

Theorem 5** (Convergence of Algorithm 2)**

Proof 5.6**.**

6 Validation on synthetic data

6.1 Validation of theoretical algorithmic convergence rates

6.2 Regularization with ℓ1\ell 1ℓ1 penalty

6.3 Nonconvex Regularization

7 NCEP Windspeed Data

8 Conclusion

9 Acknowledgement

Appendix A Appendix outline

Appendix B TeraLasso algorithm step size and numerical convergence proofs

B.1 Convergence of nonconvex regularization algorithm

B.2 Choice of step size ζt\zeta_{t}ζt​

B.3 Generation of Kronecker Sum Random Tensors

B.4 Detailed TeraLasso Algorithm

B.5 Decomposition of Objective: Proof of Lemma 4

B.6 Proof of Joint Convexity

Theorem B.7**.**

Proof B.8**.**

Appendix C Additional experiments

C.1 Convergence of nonconvex regularization algorithm

C.2 Computational Complexity of TG-ISTA

C.3 Convergence rate verification

C.4 Additional details for wind speed data experiments

C.5 Comparison between TeraLasso and Gemini (Kronecker product) log determinant geometry

Appendix D Identifiable Parameterization of Kp\mathcal{K}_{\mathbf{p}}Kp​

Lemma D.9**.**

Proof D.10**.**

Lemma D.11** (Spectral Norm Bound).**

Proof D.12**.**

D.1 Inner Product in Kp\mathcal{K}_{\mathbf{p}}Kp​

Lemma D.13** (Kronecker sum inner Products).**

Proof D.14**.**

Appendix E Proof of Theorems 1 and 2 (ℓ1\ell 1ℓ1 regularized case)

Lemma E.15**.**

Lemma E.16**.**

Lemma E.17**.**

Lemma E.18**.**

E.1 Proof of Theorem 1

Proposition E.19**.**

Proof E.20**.**

Proposition E.21**.**

Proof E.22**.**

Lemma E.23**.**

Lemma E.24**.**

Proof E.25**.**

E.2 Proof of Lemma E.17

E.3 Proof of Lemma E.18: Bound on Inner Product for Diagonal

E.4 Proof of Lemma E.23

Proposition E.26**.**

Proof E.27**.**

E.5 Proof of Theorem 2: Factorwise and Spectral Norm Bounds

Proof E.28**.**

Appendix F Proof of Lemma E.16: Subgaussian Concentration

Definition 1 ( $(\mu,\gamma)$ amenable regularizer)

4.1 Regularization with $\ell 1$ penalty

Theorem 1 (Frobenius error bound)

Theorem 2 (Factorwise and L2 error bounds)

Theorem 3 (Nonconvex Regularizers)

Lemma 4 (Decomposition of objective)

Theorem 5 (Convergence of Algorithm 2)

Proof 5.6.

6.2 Regularization with $\ell 1$ penalty

B.2 Choice of step size $\zeta_{t}$

Theorem B.7.

Proof B.8.

Appendix D Identifiable Parameterization of $\mathcal{K}_{\mathbf{p}}$

Lemma D.9.

Proof D.10.

Lemma D.11 (Spectral Norm Bound).

Proof D.12.

D.1 Inner Product in $\mathcal{K}_{\mathbf{p}}$

Lemma D.13 (Kronecker sum inner Products).

Proof D.14.

Appendix E Proof of Theorems 1 and 2 ( $\ell 1$ regularized case)

Lemma E.15.

Lemma E.16.

Lemma E.17.

Lemma E.18.

Proposition E.19.

Proof E.20.

Proposition E.21.

Proof E.22.

Lemma E.23.

Lemma E.24.

Proof E.25.

Proposition E.26.

Proof E.27.

Proof E.28.

Lemma F.29 (Subgaussian Concentration).

Proof F.30.

Lemma F.31.

Proof F.32.

Proof F.33.

Lemma G.34.

Proof G.35.

Lemma G.36.

Proof G.37.

Lemma G.38.

Proof G.39.

Lemma G.40.

G.1 Proof of Lemma G.40: Bound on $\delta_{1},\delta_{2}$

Proof G.41.

Theorem H.42.

Theorem H.43.

Theorem H.44.

Lemma H.45.

Proof H.46.

Lemma H.47.

Proof H.48.

Lemma H.49.

Proof H.50.

Lemma H.51.

Proof H.52.

Lemma H.53.

Proof H.54.

Appendix I Useful Properties of the Kronecker Sum and $\mathcal{K}_{\mathbf{p}}$

I.2 Eigenstructure of $\Omega\in\mathcal{K}_{\mathbf{p}}$

I.3 Projection onto $\mathcal{K}_{\mathbf{p}}$

Lemma I.55 (Projection onto $\mathcal{K}_{\mathbf{p}}$ ).

Proof I.56.

Corollary J.57.

Proof J.58.