Recursive blocked algorithms for linear systems with Kronecker product   structure

Minhong Chen; Daniel Kressner

arXiv:1905.09539·math.NA·May 24, 2019

Recursive blocked algorithms for linear systems with Kronecker product structure

Minhong Chen, Daniel Kressner

PDF

Open Access

TL;DR

This paper extends recursive blocked algorithms to higher-dimensional Sylvester-like equations, enabling efficient solutions for PDE discretizations and economic models, outperforming existing methods.

Contribution

It introduces a novel recursive algorithm that handles higher-dimensional Kronecker-structured equations more efficiently than previous approaches.

Findings

01

Algorithm outperforms existing Sylvester solvers

02

Efficiently solves PDE discretizations with separable coefficients

03

Applicable to macroeconomic model approximations

Abstract

Recursive blocked algorithms have proven to be highly efficient at the numerical solution of the Sylvester matrix equation and its generalizations. In this work, we show that these algorithms extend in a seamless fashion to higher-dimensional variants of generalized Sylvester matrix equations, as they arise from the discretization of PDEs with separable coefficients or the approximation of certain models in macroeconomics. By combining recursions with a mechanism for merging dimensions, an efficient algorithm is derived that outperforms existing approaches based on Sylvester solvers.

Equations65

A_{1} X + X A_{2}^{T} = B,

A_{1} X + X A_{2}^{T} = B,

A_{1, 11} X_{1} + X_{1} A_{2}^{T}

A_{1, 11} X_{1} + X_{1} A_{2}^{T}

A_{1, 22} X_{2} + X_{2} A_{2}^{T}

A X = B,

A X = B,

A = A_{d} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}} + I_{n_{d}} \otimes A_{d - 1} \otimes I_{n_{d - 2}} \otimes \dots \otimes I_{n_{1}} + \dots + I_{n_{d}} \otimes \dots \otimes I_{n_{2}} \otimes A_{1},

A = A_{d} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{1}} + I_{n_{d}} \otimes A_{d - 1} \otimes I_{n_{d - 2}} \otimes \dots \otimes I_{n_{1}} + \dots + I_{n_{d}} \otimes \dots \otimes I_{n_{2}} \otimes A_{1},

A = I_{n_{d}} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{2}} \otimes A_{1} + A_{d} \otimes A_{d - 1} \otimes \dots \otimes A_{2} \otimes C,

A = I_{n_{d}} \otimes I_{n_{d - 1}} \otimes \dots \otimes I_{n_{2}} \otimes A_{1} + A_{d} \otimes A_{d - 1} \otimes \dots \otimes A_{2} \otimes C,

A = A_{d} \otimes \dots \otimes A_{2} \otimes C - λ I_{n_{1} \dots n_{d}} .

A = A_{d} \otimes \dots \otimes A_{2} \otimes C - λ I_{n_{1} \dots n_{d}} .

X_{(μ)} (i_{μ}, j) = X (i_{1}, \dots, i_{d}),

X_{(μ)} (i_{μ}, j) = X (i_{1}, \dots, i_{d}),

j = i (i_{1}, \dots, i_{μ - 1}, i_{μ + 1}, \dots, i_{d}) := 1 + ν \neq = μ ν = 1 \sum d (i_{ν} - 1) η \neq = μ η = 1 \prod ν - 1 n_{η} .

j = i (i_{1}, \dots, i_{μ - 1}, i_{μ + 1}, \dots, i_{d}) := 1 + ν \neq = μ ν = 1 \sum d (i_{ν} - 1) η \neq = μ η = 1 \prod ν - 1 n_{η} .

X \times_{1} A_{1} + X \times_{2} A_{2} + \dots + X \times_{d} A_{d} = B .

X \times_{1} A_{1} + X \times_{2} A_{2} + \dots + X \times_{d} A_{d} = B .

X_{1} \times_{μ} A_{μ, 11} + ν \neq = μ ν = 1 \sum d X_{1} \times_{ν} A_{ν}

X_{1} \times_{μ} A_{μ, 11} + ν \neq = μ ν = 1 \sum d X_{1} \times_{ν} A_{ν}

X_{2} \times_{μ} A_{μ, 22} + ν \neq = μ ν = 1 \sum d X_{2} \times_{ν} A_{ν}

X_{1}

X_{1}

X_{2}

\mathsf{comp}(n)=O\big{(}n^{d+1}\big{)}+(2^{d})^{\log_{2}n/n_{\min}}\mathsf{comp}(n_{\min})=O\big{(}n^{d+1}\big{)}+\frac{n^{d}}{n^{d}_{\min}}\mathsf{comp}(n_{\min}).

\mathsf{comp}(n)=O\big{(}n^{d+1}\big{)}+(2^{d})^{\log_{2}n/n_{\min}}\mathsf{comp}(n_{\min})=O\big{(}n^{d+1}\big{)}+\frac{n^{d}}{n^{d}_{\min}}\mathsf{comp}(n_{\min}).

\mathsf{comp}(n)=O\big{(}n^{d+1}+n^{d}_{\min}n^{d}\big{)}.

\mathsf{comp}(n)=O\big{(}n^{d+1}+n^{d}_{\min}n^{d}\big{)}.

A_{1}^{'} = I_{n_{2}} \otimes A_{1} + A_{2} \otimes I_{n_{1}}

A_{1}^{'} = I_{n_{2}} \otimes A_{1} + A_{2} \otimes I_{n_{1}}

X^{'} \times_{1} A_{1}^{'} + X^{'} \times_{3} A_{3} + \dots + X^{'} \times_{d} A_{d} = B^{'},

X^{'} \times_{1} A_{1}^{'} + X^{'} \times_{3} A_{3} + \dots + X^{'} \times_{d} A_{d} = B^{'},

A_{1}^{'} X^{'} + X^{'} A_{3}^{T} = B^{'},

A_{1}^{'} X^{'} + X^{'} A_{3}^{T} = B^{'},

\overline{\mathsf{comp}}_{d}(n_{\min})=O\big{(}n_{\min}^{d+2}\big{)}+n_{\min}\overline{\mathsf{comp}}_{d-1}(n_{\min})=O\big{(}n_{\min}^{d+2}\big{)}+n^{d-3}_{\min}\,\overline{\mathsf{comp}}_{3}(n_{\min}).

\overline{\mathsf{comp}}_{d}(n_{\min})=O\big{(}n_{\min}^{d+2}\big{)}+n_{\min}\overline{\mathsf{comp}}_{d-1}(n_{\min})=O\big{(}n_{\min}^{d+2}\big{)}+n^{d-3}_{\min}\,\overline{\mathsf{comp}}_{3}(n_{\min}).

O\big{(}n^{d+1}+n_{\min}^{2}n^{d}\big{)}

O\big{(}n^{d+1}+n_{\min}^{2}n^{d}\big{)}

A_{1}={\scriptsize\left[\begin{array}[]{cccc}\times&\times&\times&\times\\ 0&\times&\times&\times\\ 0&\times&\times&\times\\ 0&0&0&\times\end{array}\right]},\quad A_{2}={\scriptsize\left[\begin{array}[]{ccc}\times&\times&\times\\ 0&\times&\times\\ 0&\times&\times\end{array}\right]},\quad A_{1}^{\prime}={\scriptsize\left[\begin{array}[]{cccc|cccc|cccc}\times&\times&\times&\times&\times&0&0&0&\times&0&0&0\\ 0&\times&\times&\times&0&\times&0&0&0&\times&0&0\\ 0&\times&\times&\times&0&0&\times&0&0&0&\times&0\\ 0&0&0&\times&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&0&0&0\\ 0&0&0&0&0&\times&\times&\times&0&\times&0&0\\ 0&0&0&0&0&\times&\times&\times&0&0&\times&0\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&0&0&0&\times&\times&\times&\times\\ 0&0&0&0&0&\times&0&0&0&\times&\times&\times\\ 0&0&0&0&0&0&\times&0&0&\times&\times&\times\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\end{array}\right]}.

A_{1}={\scriptsize\left[\begin{array}[]{cccc}\times&\times&\times&\times\\ 0&\times&\times&\times\\ 0&\times&\times&\times\\ 0&0&0&\times\end{array}\right]},\quad A_{2}={\scriptsize\left[\begin{array}[]{ccc}\times&\times&\times\\ 0&\times&\times\\ 0&\times&\times\end{array}\right]},\quad A_{1}^{\prime}={\scriptsize\left[\begin{array}[]{cccc|cccc|cccc}\times&\times&\times&\times&\times&0&0&0&\times&0&0&0\\ 0&\times&\times&\times&0&\times&0&0&0&\times&0&0\\ 0&\times&\times&\times&0&0&\times&0&0&0&\times&0\\ 0&0&0&\times&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&0&0&0\\ 0&0&0&0&0&\times&\times&\times&0&\times&0&0\\ 0&0&0&0&0&\times&\times&\times&0&0&\times&0\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&0&0&0&\times&\times&\times&\times\\ 0&0&0&0&0&\times&0&0&0&\times&\times&\times\\ 0&0&0&0&0&0&\times&0&0&\times&\times&\times\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\end{array}\right]}.

P^{T}A_{1}^{\prime}P={\scriptsize\left[\begin{array}[]{cccc|cc|cccc|cc}\times&\times&\times&\times&\times&\times&0&0&0&0&0&0\\ 0&\times&\times&\times&0&0&\times&\times&0&0&0&0\\ 0&\times&\times&\times&0&0&0&0&\times&\times&0&0\\ 0&0&0&\times&0&0&0&0&0&0&\times&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&0&\times&0&\times&0\\ 0&0&0&0&\times&\times&0&\times&0&\times&0&\times\\ \hline\cr 0&0&0&0&0&0&\times&\times&\times&0&\times&0\\ 0&0&0&0&0&0&\times&\times&0&\times&0&\times\\ 0&0&0&0&0&0&\times&0&\times&\times&\times&0\\ 0&0&0&0&0&0&0&\times&\times&\times&0&\times\\ \hline\cr 0&0&0&0&0&0&0&0&0&0&\times&\times\\ 0&0&0&0&0&0&0&0&0&0&\times&\times\end{array}\right].}

P^{T}A_{1}^{\prime}P={\scriptsize\left[\begin{array}[]{cccc|cc|cccc|cc}\times&\times&\times&\times&\times&\times&0&0&0&0&0&0\\ 0&\times&\times&\times&0&0&\times&\times&0&0&0&0\\ 0&\times&\times&\times&0&0&0&0&\times&\times&0&0\\ 0&0&0&\times&0&0&0&0&0&0&\times&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&0&\times&0&\times&0\\ 0&0&0&0&\times&\times&0&\times&0&\times&0&\times\\ \hline\cr 0&0&0&0&0&0&\times&\times&\times&0&\times&0\\ 0&0&0&0&0&0&\times&\times&0&\times&0&\times\\ 0&0&0&0&0&0&\times&0&\times&\times&\times&0\\ 0&0&0&0&0&0&0&\times&\times&\times&0&\times\\ \hline\cr 0&0&0&0&0&0&0&0&0&0&\times&\times\\ 0&0&0&0&0&0&0&0&0&0&\times&\times\end{array}\right].}

\begin{array}[]{l}(I\otimes A_{1}+A_{2}\otimes I)X+XA_{3}^{T}=B,\quad(I\otimes A_{1}+A_{2}\otimes I)X+X(I\otimes A_{3}+A_{4}\otimes I)^{T}=B\\ (I\otimes A_{1}+A_{2}\otimes I)X+X(I\otimes I\otimes A_{3}+I\otimes A_{4}\otimes I+A_{5}\otimes I\otimes I)^{T}=B\end{array}

\begin{array}[]{l}(I\otimes A_{1}+A_{2}\otimes I)X+XA_{3}^{T}=B,\quad(I\otimes A_{1}+A_{2}\otimes I)X+X(I\otimes A_{3}+A_{4}\otimes I)^{T}=B\\ (I\otimes A_{1}+A_{2}\otimes I)X+X(I\otimes I\otimes A_{3}+I\otimes A_{4}\otimes I+A_{5}\otimes I\otimes I)^{T}=B\end{array}

X \times_{1} A_{1} + X \times_{1} C \times_{2} A_{2} \times_{3} A_{3} \dots \times_{d} A_{d} = B .

X \times_{1} A_{1} + X \times_{1} C \times_{2} A_{2} \times_{3} A_{3} \dots \times_{d} A_{d} = B .

X_{1} \times_{1} A_{1, 11} + X_{1} \times_{1} C_{11} \times_{2} A_{2} \times_{3} \dots \times_{d} A_{d}

X_{1} \times_{1} A_{1, 11} + X_{1} \times_{1} C_{11} \times_{2} A_{2} \times_{3} \dots \times_{d} A_{d}

X_{2} \times_{1} A_{1, 22} + X_{2} \times_{1} C_{22} \times_{2} A_{2} \times_{3} \dots \times_{d} A_{d}

X_{1} \times_{1} A_{1} + X_{1} \times_{1} C \times_{2} \dots \times_{μ} A_{μ, 11} \times_{μ + 1} \dots \times_{d} A_{d}

X_{1} \times_{1} A_{1} + X_{1} \times_{1} C \times_{2} \dots \times_{μ} A_{μ, 11} \times_{μ + 1} \dots \times_{d} A_{d}

X_{2} \times_{1} A_{1} + X_{2} \times_{1} C \times_{2} \dots \times_{μ} A_{μ, 22} \times_{μ + 1} \dots \times_{d} A_{d}

A_{d - 1}^{'} = A_{d} \otimes A_{d - 1}

A_{d - 1}^{'} = A_{d} \otimes A_{d - 1}

X^{'} \times_{1} A_{1} + X^{'} \times_{1} C \times A_{2} \times_{3} \dots \times_{d} A_{d - 1}^{'} = B^{'},

X^{'} \times_{1} A_{1} + X^{'} \times_{1} C \times A_{2} \times_{3} \dots \times_{d} A_{d - 1}^{'} = B^{'},

A_{d-1}={\scriptsize\left[\begin{array}[]{cccc}\times&\times&\times&\times\\ 0&\times&\times&\times\\ 0&\times&\times&\times\\ 0&0&0&\times\end{array}\right]},\quad A_{d}={\scriptsize\left[\begin{array}[]{ccc}\times&\times&\times\\ 0&\times&\times\\ 0&\times&\times\end{array}\right]},\quad A_{d-1}^{\prime}={\scriptsize\left[\begin{array}[]{cccc|cccc|cccc}\times&\times&\times&\times&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&\times&\times&\times&0&\times&\times&\times&0&\times&\times&\times\\ 0&\times&\times&\times&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&\times&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\end{array}\right]},

A_{d-1}={\scriptsize\left[\begin{array}[]{cccc}\times&\times&\times&\times\\ 0&\times&\times&\times\\ 0&\times&\times&\times\\ 0&0&0&\times\end{array}\right]},\quad A_{d}={\scriptsize\left[\begin{array}[]{ccc}\times&\times&\times\\ 0&\times&\times\\ 0&\times&\times\end{array}\right]},\quad A_{d-1}^{\prime}={\scriptsize\left[\begin{array}[]{cccc|cccc|cccc}\times&\times&\times&\times&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&\times&\times&\times&0&\times&\times&\times&0&\times&\times&\times\\ 0&\times&\times&\times&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&\times&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&\times&\times&\times&0&\times&\times&\times\\ 0&0&0&0&0&0&0&\times&0&0&0&\times\end{array}\right]},

P^{T}A_{d-1}^{\prime}P={\scriptsize\left[\begin{array}[]{cccc|cc|cccc|cc}\times&\times&\times&\times&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&\times&\times&\times&0&0&\times&\times&\times&\times&\times&\times\\ 0&\times&\times&\times&0&0&\times&\times&\times&\times&\times&\times\\ 0&0&0&\times&0&0&0&0&0&0&\times&\times\\ \hline\cr 0&0&0&0&\times&\times&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&\times&\times&\times&\times&\times&\times&\times&\times\\ \hline\cr 0&0&0&0&0&0&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&0&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&0&\times&\times&\times&\times&\times&\times\\ 0&0&0&0&0&0&\times&\times&\times&\times&\times&\times\\ \hline\cr 0&0&0&0&0&0&0&0&0&0&\times&\times\\ 0&0&0&0&0&0&0&0&0&0&\times&\times\end{array}\right].}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMatrix Theory and Algorithms · Tensor decomposition and applications · Model Reduction and Neural Networks

Full text

Recursive blocked algorithms for linear systems

with Kronecker product structure

Minhong Chen111Department of Mathematics, Zhejiang Sci-Tech University, Hangzhou, 310029, Zhejiang, P.R.China, [email protected]. The work of this author was supported by the National Natural Science Foundation of China (Grant No. 11801513).

Daniel Kressner222Institute of Mathematics, EPF Lausanne, 1015 Lausanne, Switzerland, [email protected].

Abstract

Recursive blocked algorithms have proven to be highly efficient at the numerical solution of the Sylvester matrix equation and its generalizations. In this work, we show that these algorithms extend in a seamless fashion to higher-dimensional variants of generalized Sylvester matrix equations, as they arise from the discretization of PDEs with separable coefficients or the approximation of certain models in macroeconomics. By combining recursions with a mechanism for merging dimensions, an efficient algorithm is derived that outperforms existing approaches based on Sylvester solvers.

1 Introduction

In computations with matrices, recursive blocked algorithms offer an elegant way to arrive at implementations that benefit from increased data locality and efficiently utilize highly tuned kernels. See [7] for a survey and [22] for a more recent testimony of this principle. These algorithms have proven particularly effective for solving Sylvester equations, that is, matrix equations of the form

[TABLE]

where $A_{1}\in{\mathbb{R}}^{n_{1}\times n_{1}},A_{2}\in{\mathbb{R}}^{n_{2}\times n_{2}}$ , $B\in{\mathbb{R}}^{n_{1}\times n_{2}}$ are given and $X\in{\mathbb{R}}^{n_{1}\times n_{2}}$ is unknown. In the Bartels-Stewart algorithm [2], the matrices $A_{1}$ and $A_{2}$ are first reduced to block upper form by real Schur decompositions. The reduced problem is then solved by a variant of backward substitution. Both stages of the algorithms require $O(n^{3})$ operations, with $n=\max\{n_{1},n_{2}\}$ . Entirely consisting of level 2 BLAS operations, the backward substitution step performs quite poorly. To avoid this, Jonsson and Kågström [12, 13] have proposed recursive algorithms for triangular Sylvester and related matrix equations. The recursive algorithm for solving (1.1) with upper quasi-triangular $A_{1},A_{2}$ starts with partitioning the matrix of larger size. Assuming $n_{1}\geq n_{2}$ , let $A_{1}=\begin{pmatrix}A_{1,11}&A_{1,12}\\ 0&A_{1,22}\end{pmatrix}$ with $A_{1,11}\in{\mathbb{R}}^{k\times k}$ such that $k\approx n/2$ and partition $X=\begin{pmatrix}X_{1}\\ X_{2}\end{pmatrix}$ , $B=\begin{pmatrix}B_{1}\\ B_{2}\end{pmatrix}$ correspondingly. Then (1.1) becomes equivalent to

[TABLE]

First the Sylvester equation (1.2b) is solved recursively, then the right-hand side (1.2a) is updated, and finally (1.2a) is solved recursively. Apart from the solution of small-sized Sylvester equations at the lowest recursion level, the entire algorithm consists of matrix-matrix multiplications $A_{1,12}X_{2}$ and thus attains high performance by leveraging level 3 BLAS. As emphasized in [7, 22], recursive algorithm are less sensitive to parameter tuning compared to blocked algorithms.

The described algorithm extends to generalized and coupled Sylvester equations, such as $A_{1}XM_{1}+M_{2}XA_{2}^{T}=B$ ; see [13, 23]. Interestingly, the numerically stable recursive formulation of Hammarling’s method [11] for solving stable Lyapunov equations remains an open problem [16].

In this paper, we propose several new extensions that address high-dimensional variants of Sylvester equations. More specifically, we aim at computing a tensor ${\textbf{X}}\in{\mathbb{R}}^{n_{1}\times n_{2}\times\cdots\times n_{d}}$ satisfying the linear equation

[TABLE]

where ${\mathcal{A}}:{\mathbb{R}}^{n_{1}\times\cdots\times n_{d}}\to{\mathbb{R}}^{n_{1}\times\cdots\times n_{d}}$ is a linear operator and ${\textbf{B}}\in{\mathbb{R}}^{n_{1}\times n_{2}\cdots\times n_{d}}$ . For $d=2$ , this formulation includes the Sylvester equation (1.1) and its generalizations mentioned above as special cases. For example, for (1.1) the matrix representation of ${\mathcal{A}}$ is given by $A=A_{2}\otimes I_{n_{1}}+I_{n_{2}}\otimes A_{1}$ .

The operator ${\mathcal{A}}$ needs to be of a very particular form such that (1.3) is amenable to the techniques discussed in this work. Motivated by their relevance in applications, we focus on two classes of operators.

Linear systems with Laplace-like structure.

In Section 2, we consider discrete Laplace-like operators ${\mathcal{A}}$ having the matrix representation

[TABLE]

with $A_{\mu}\in{\mathbb{R}}^{n_{\mu}\times n_{\mu}}$ , $\mu=1,\ldots,d$ . Using the vectorization of tensors, (1.3) can equivalently be written as $A\operatorname{vec}({\textbf{X}})=\operatorname{vec}({\textbf{B}})$ . Discrete Laplace-like operators arise from the structured discretization of $d$ -dimensional PDEs with separable coefficients on tensorized domains. For more general PDEs, matrices of the form (1.4) can sometimes be used to construct effective preconditioners; see [24, 25] for examples. Other applications of (1.4) arise from Markov chain models [5, 26] used, e.g., for simulating interconnected systems.

Generalized Sylvester equations with Kronecker structure.

Section 3 is concerned with the second class of operators ${\mathcal{A}}$ considered in this work, which have a matrix representation of the form

[TABLE]

with $A_{\mu}\in{\mathbb{R}}^{n_{\mu}\times n_{\mu}}$ for $\mu=1,\ldots,d$ and $C\in{\mathbb{R}}^{n_{1}\times n_{1}}$ . For $d=2$ , the linear system (1.3) now becomes equivalent to the generalized Sylvester equation $A_{1}X+CXA_{2}^{T}$ . For $d>2$ , we can view (1.3) equivalently as a generalized Sylvester equations with coefficients that feature Kronecker structure. If $A_{1}=-\lambda I$ for some $\lambda\in{\mathbb{R}}$ then

[TABLE]

Linear systems featuring such shifted Kronecker products have been discussed in [19]. The more general case (1.5) arises from approximations of discrete time DSGE models [3], which play a central role in macroeconomics.

Recent work on the solution of linear tensor equations (1.3) has focused on the development of highly efficient approximate and iterative solvers that assume and exploit low-rank tensor structure in the right-hand side and the solution; see [9, 10] for overviews. In some cases, these developments can be combined with the methods developed in this work, which do not assume any such structure. For example, if the tensor Krylov subspace method [17] is applied to (1.4) for large-scale coefficients $A_{\mu}$ then our method can be used to solve the smaller-sized linear systems occurring in the method. As far as we know, all existing direct non-iterative solvers for linear tensor equations combine the Bartels-Stewart method for (generalized) Sylvester equations with a recursive traversal of the dimension. Instances of this approach can be found in [18, 27] for (1.4), in [14] for (1.5), and in [19] for (1.6). For $d\geq 3$ , we are not aware of any work on (recursive) blocked methods that would allow for the effective use of level 3 BLAS.

2 A recursive blocked algorithm for Laplace-like equations

Let us first recall two basic operations for tensors from [15]. The $\mu$ th matricization of a tensor ${\textbf{X}}\in{\mathbb{R}}^{n_{1}\times\cdots\times n_{d}}$ is the matrix $X_{(\mu)}\in{\mathbb{R}}^{n_{\mu}\times(n_{1}\cdots n_{\mu-1}n_{\mu+1}\cdots n_{d})}$ obtained by mapping the $\mu$ th index to the rows and all other indices to the columns:

[TABLE]

with the column index $j$ defined via the index map

[TABLE]

The $\mu$ -mode matrix multiplication of X with a matrix $A\in{\mathbb{R}}^{n_{1}\times m}$ is the tensor ${\textbf{Y}}={\textbf{X}}\times_{\mu}A$ satisfying $Y_{(\mu)}=AX_{(\mu)}$ . This allows us to rewrite (1.3)–(1.4) as

[TABLE]

It is well known that this equation has a unique solution if and only if $\lambda_{1}+\cdots+\lambda_{d}\not=0$ for any eigenvalues $\lambda_{1}$ of $A_{1}$ , $\lambda_{2}$ of $A_{2}$ , etc. In the following, we will assume that this condition is satisfied.

Algorithm 1 describes our general framework for solving (2.2). Using real Schur decompositions [8, Sec. 7.4], the coefficient matrices are first transformed to reduced form. More specifically, for each $\mu=1,\ldots,d$ an orthogonal matrix $U_{\mu}$ is computed such that $\tilde{A}_{\mu}:=U_{\mu}^{T}A_{\mu}U_{\mu}$ is in upper quasi-triangular form, that is, $\tilde{A}_{\mu}$ is an upper block triangular matrix with $1\times 1$ blocks containing its real eigenvalues and $2\times 2$ blocks containing its complex eigenvalues in conjugate pairs. The right-hand side and the solution tensor need to be transformed accordingly by $\mu$ -mode matrix multiplications. For the rest of this section, we focus on line 3 of Algorithm 1, that is, the solution of the tensor equation with the reduced coefficients.

2.1 Recursion

By Algorithm 1, we may assume that $A_{1}\in{\mathbb{R}}^{n_{1}\times n_{1}},\ldots,A_{d}\in{\mathbb{R}}^{n_{d}\times n_{d}}$ are already in upper quasi-triangular form. Choose $\mu$ such that $n_{\mu}=\max_{\nu}n_{\nu}$ and $k$ such that $k\approx n_{\mu}/2$ and $A_{\mu}(k+1,k)=0$ . Partitioning $A_{\mu}=\begin{pmatrix}A_{\mu,11}&A_{\mu,12}\\ 0&A_{\mu,22}\end{pmatrix}$ with $A_{\mu,11}\in{\mathbb{R}}^{k\times k}$ , equation (2.2) becomes equivalent to

[TABLE]

where

[TABLE]

and ${\textbf{B}}_{1},{\textbf{B}}_{2}$ are defined analogously. Noting that (2.3b) and (2.3a) are again equations with Laplace-like operators, they can be solved recursively. The recursion is stopped once the maximal size is below a user-specified block size $n_{\min}\geq 2$ . These considerations lead to Algorithm 2.

Let $\mathsf{comp}(n)$ denote the complexity of Algorithm 2 for even $n=n_{1}=\cdots=n_{d}$ . On the top level of recursion Algorithm 2 is applied to one $n\times\cdots\times n$ tensor, on the second level to two $n/2\times n\times\cdots\times n$ tensors, on the third level to four $n/2\times n/2\times n\times\cdots\times n$ tensors, and so on. Under the slightly simplified assumption that the multiplication of an $n/2\times n/2$ quasi-triangular matrix with a vector requires $n^{2}/4$ floating point operations (flops), each level of the first $d$ recursions requires a total of $n^{d+1}/4$ flops to execute the matrix-matrix multiplications in line 6 of Algorithm 2. After $d$ recursions of Algorithm 2, $n$ has been reduced to $n/2$ in each mode and, therefore, $\mathsf{comp}(n)=dn^{d+1}/4+2^{d}\mathsf{comp}(n/2).$ Assuming that $n/n_{\min}$ is a power of two, we obtain

[TABLE]

Once the maximal size of the tensor is $n_{\min}$ or below, line 2 of Algorithm 2 assembles the matrix $A$ defined in (1.4) and solves the block triangular linear system $A\operatorname{vec}({\textbf{X}})=\operatorname{vec}({\textbf{B}})$ by backward substitution. This requires $O\big{(}(n_{\min})^{2d}\big{)}$ flops and therefore

[TABLE]

This compares favorably with the $O(n^{2d})$ operations needed by backward substitution applied to the assembled full triangular linear system. The complexity estimate (2.6) also reflects the critical role played by the solution of the small systems in line 2. On the one hand, the operation count suggests to choose $n_{\min}$ as small as possible, say, $n_{\min}=2$ . On the other hand, it has been observed for $d=2$ in [12] that a small value of $n_{\min}$ creates significant overhead and requires very well tuned kernels. In the following section, we describe a technique that alleviates this difficulty.

2.2 Merging dimensions: triangular case

To avoid the critical dependence on $n_{\min}$ observed in (2.6) we replace line 2 of Algorithm 2 by the following procedure. Once $n_{1}n_{2}\leq n_{\min}^{2}$ , the matrix

[TABLE]

is formed explicitly. For the moment, let us suppose that $A_{1}$ and $A_{2}$ are upper triangular. This can be achieved by computing complex instead of real Schur decompositions in Algorithm 1, leading to a triangular tensor equation with complex coefficients. Because of roundoff error, the computed solution to the original equation will now have a (small) imaginary part. This can be safely set to zero [20].

The matrix $A^{\prime}_{1}$ inherits the triangular structure from $A_{1},A_{2}$ and the $d$ -dimensional tensor equation (2.2) is equivalent to the $(d-1)$ -dimensional equation

[TABLE]

with reshaped ${\textbf{X}}^{\prime},{\textbf{B}}^{\prime}\in{\mathbb{C}}^{n_{1}n_{2}\times n_{3}\cdots\times n_{d}}$ . This equation is solved recursively. A major advantage, this approach allows us to reduce $d$ . For $d=3$ , the system (2.8) becomes the triangular Sylvester equation

[TABLE]

to which the efficient solvers described in Section 1 can be applied. Note that $A_{3}^{T}$ now refers to the complex transpose of $A_{3}\in{\mathbb{C}}^{n_{3}\times n_{3}}$ . Algorithm 3 summarizes the proposed procedure.

To analyze the complexity of Algorithm 3 for $n_{1}=\cdots=n_{d}=n>2n_{\min}$ , we observe that all sizes are first reduced to $2n_{\min}$ or below before the condition in line 1 is met. Hence, up to constant factors the recursive estimate (2.5) holds and it remains to discuss the complexity for $n_{1}=\cdots=n_{d}=n_{\min}$ , which will be denoted by $\overline{\mathsf{comp}}_{d}(n_{\min})$ . The merge in line 2 reduces the order to $d-1$ but increases the first mode size to $n_{\min}^{2}$ . Approximately $\log_{2}(n_{\min}^{2}/n_{\min})=\log_{2}n_{\min}$ recursions are needed to reduce it back to $n_{\min}$ . Similarly as in Section 2.1 we calculate

[TABLE]

For $d=3$ , the solution of the triangular Sylvester equation in line 5 requires $O\big{(}n_{\min}^{5}\big{)}$ flops. In turn, $\overline{\mathsf{comp}}_{d}(n_{\min})=O\big{(}n_{\min}^{d+2}\big{)}$ . Inserted into (2.5), we arrive at

[TABLE]

flops for Algorithm 3. For $d\geq 3$ , this compares favorably with the complexity estimate (2.6) for Algorithm 2; the dependence on $n_{\min}$ has been reduced significantly. Equally importantly, Algorithm 3 allows us to leverage efficient solvers for triangular Sylvester equations, such as the ones described in [12].

2.3 Merging dimensions: quasi-triangular case

The use of complex arithmetic, which increases the cost (by a constant factor) in terms of operations and memory, can be avoided when using the real Schur form and working with quasi-triangular coefficients. However, a few modifications are needed because the matrix $A^{\prime}_{1}$ formed in (2.7) does not inherit the quasi-triangular structure from $A_{1}$ and $A_{2}$ . To illustrate what happens, let us consider the following example for $n_{1}=3,n_{2}=4$ :

[TABLE]

The diagonal matrix at the (3,2) block disturbs the quasi-triangular structure of $A_{1}^{\prime}$ . More generally, assuming $n_{1}=n_{2}=n_{\min}$ the matrix $A_{1}^{\prime}$ is an $n_{\min}^{2}\times n_{\min}^{2}$ block upper triangular matrix with diagonal blocks of size at most $n_{\min}$ . This matrix can be returned to quasi-triangular form by computing a real Schur decomposition of $A_{1}^{\prime}$ . The impact of this operation on the overall cost of Algorithm 3 can be made negligible by exploiting the structure of $A_{1}^{\prime}$ :

•

When the structure of $A_{1}^{\prime}$ is completely ignored, its real Schur decomposition takes $O(n_{\min}^{6})$ flops and, in turn, the complexity of Algorithm 3 increases to $O\big{(}n^{d+1}+n_{\min}^{3}n^{d}\big{)}$ .

•

When the block triangular structure of $A_{1}^{\prime}$ is taken into account, the cost of computing its real Schur decomposition reduces to $O(n_{\min}^{5})$ flops. When used within Algorithm 3, the additional flops spent on performing these decompositions and applying the resulting orthogonal transformations amounts to $O\big{(}n_{\min}^{2}n^{d}\big{)}$ in total. In turn, this operation does not increase the complexity of Algorithm 3 but its dependence on $n_{\min}^{2}$ is not negligible either.

•

The diagonal structure of the off-diagonal blocks of $A_{1}^{\prime}$ can be exploited to reduce the cost further, using a permutation trick similar to the one discussed in [19]. To illustrate this, consider the $12\times 12$ matrix $A_{1}^{\prime}$ from (2.9). By applying a perfect shuffle permutation [28] to the last $8$ rows and columns, we obtain the permuted matrix

[TABLE]

In the general case, applying such a permutation to each $n_{\min}\times n_{\min}$ diagonal block transforms $A_{1}^{\prime}$ into a block upper triangular matrix with diagonal blocks of size at most $4$ . This reduces the cost of computing its real Schur decomposition to $O(n_{\min}^{4})$ flops and the overall impact of this operation on the cost of Algorithm 3 becomes negligible.

2.4 Numerical experiments

All algorithms proposed in this work have been implemented in Matlab R2019a and executed on a Lenovo ThinkPad T460, which comes with an Intel Core i5-6300U processor and 8 Gbytes of DDR3L-RAM. The implementation of the algorithms together with scripts for reproducing each of the experiments reported in this work are available from https://anchp.epfl.ch/misc/.

Care has been taken to avoid unnecessary overhead in our Matlab implementation. For example, the tensor object from the Tensor Toolbox [1] is very convenient for realizing tensor operations but our preliminary experiments indicated that its use in Algorithms 2 and 3 would lead to significant performance loss, possibly due to excessive memory transfer. Instead, we directly use Matlab arrays, combined with the permute and reshape functions for implementing $\mu$ -mode matrix multiplications. For solving triangular Sylvester equations, as needed, e.g., in Algorithm 3, we utilize the internal Matlab function sylvester_tri. This function seems to be based on the algorithms presented in [12, 13] and avoids performing any additional Schur decomposition.

The techniques from Section 2.3, which allow for the use of real arithmetic in Algorithm 3, have been implemented and verified. However, we observed that none of the three described variants leads to competitive performance, any benefit from structure exploitation is offset by the overhead it incurs in Matlab, due to the relatively small values of $n_{\min}$ needed for reaching good performance. In the following, we therefore consistently use complex Schur decompositions for reducing all coefficients to triangular form. All reported times include the time needed by Algorithm 1 for performing these decompositions and applying the corresponding transformations. The coefficients used in our experiments have been generated with randn.

Choice of $n_{\min}$ .

Figure 1 shows the execution times obtained for fixed $n$ and varying $n_{\min}$ . All numbers have been averaged over five consecutive runs. As to be expected from the complexity estimates, the performance of Algorithm 2 is very sensitive to the choice of $n_{\min}$ , especially for $d=4$ . The smallest execution times are attained by $n_{\min}=7$ for $d=3$ and $n_{\min}=3$ for $d=4$ . The performance of Algorithm 3 is not very sensitive to the choice of $n_{\min}$ , provided that its value is not chosen too small. The smallest execution times are attained by $n_{\min}=26$ for $d=3$ , $n_{\min}=18$ for $d=4$ , and $n_{\min}=14$ for $d=5$ . These values of $n_{\min}$ are used in the following.

Comparison.

We have compared our newly proposed algorithms with the following procedure termed “Sylvester solver”: After reducing the coefficients $A_{1},\ldots,A_{d}$ of the Laplace-like equation (2.2) to triangular form and reshaping B suitably into a matrix $B$ , one of the Sylvester equations

[TABLE]

is solved for $d=3,4,5$ by calling sylvester_tri. The results reported in Figure 2 confirm that Algorithms 2 and 3 have the same asymptotic cost. However, Algorithm 3 is always faster, by an order of magnitude for sufficiently large $n$ . For $d=3$ , the Sylvester solver is nearly always slower than Algorithm 3. For $d=4$ , the picture is less clear; only for $n\geq 50$ becomes Algorithm 3, which has complexity $O(n^{5})$ , consistently faster than the Sylvester solver, which has complexity $O(n^{6})$ . For $d=5$ , the difference in complexity is more pronounced and, in turn, Algorithm 3 is nearly always faster.

For all experiments performed, the norm of the residual was checked and no significant differences in terms of numerical stability were observed between the different algorithms tested.

3 A recursive blocked algorithm for generalized Sylvester equations with Kronecker structure

In this section, we extend the developments from Section 2 to the second class of operators ${\mathcal{A}}$ considered in this work, which have the matrix representation (1.5). The corresponding linear system reads in tensor notation as

[TABLE]

Because of its connection to generalized Sylvester equations [4] explained in the introduction, this equation has a unique solution if and only if the matrix pencil $A_{1}+\lambda C$ is regular and none of its eigenvalues is an eigenvalue of $-A_{d}\otimes A_{d-1}\otimes\cdots\otimes A_{2}$ . In the following, we assume that this condition is satisfied.

Algorithm 4 is the equivalent of Algorithm 1 for reducing (3.1) to quasi-triangular form. The most notable difference is that now a generalized Schur decomposition [8, Sec. 7.7.2] of $A_{1}+\lambda C$ needs to be computed, using the QZ algorithm.

3.1 Recursion

The rest of this section is concerned with line 4 of Algorithm 4, solving (3.1) with upper quasi-triangular coefficients $A_{1}\in{\mathbb{R}}^{n_{1}\times n_{1}},\ldots,A_{d}\in{\mathbb{R}}^{n_{d}\times n_{d}}$ and upper triangular $C\in{\mathbb{R}}^{n_{1}\times n_{1}}$ . Again we proceed recursively and choose $\mu$ such that $n_{\mu}=\max_{\nu}n_{\nu}$ and $k$ such that $k\approx n_{\mu}/2$ and $A_{\mu}(k+1,k)=0$ . We partition $A_{\mu}=\begin{pmatrix}A_{\mu,11}&A_{\mu,12}\\ 0&A_{\mu,22}\end{pmatrix}$ , $A_{\mu,11}\in{\mathbb{R}}^{k\times k}$ and split the tensors X and B along their $\mu$ th mode into ${\textbf{X}}_{1},{\textbf{X}}_{2}$ and ${\textbf{B}}_{1},{\textbf{B}}_{2}$ , respectively, in accordance with (2.4).

Case 1: $\mu=1$ . We additionally partition $C=\begin{pmatrix}C_{11}&C_{12}\\ 0&C_{22}\end{pmatrix}$ and decouple (3.1) along the first mode:

[TABLE]

with $\hat{\textbf{B}}_{1}:={\textbf{B}}_{1}-{\textbf{X}}_{2}\times_{1}A_{1,12}-{\textbf{X}}_{2}\times_{1}C_{12}\times_{2}A_{2}\times_{3}\cdots\times_{d}A_{d}$ . Both equations take the form of the tensor equation (3.1) with (quasi-)triangular coefficients. We recursively solve for ${\textbf{X}}_{2}$ and then solve for ${\textbf{X}}_{1}$ , after computing $\hat{\textbf{B}}_{1}$ .

Case 2: $\mu\neq 1$ . Decoupling (3.1) along the $\mu$ th mode gives the two tensor equations

[TABLE]

with $\hat{\textbf{B}}_{1}:={\textbf{B}}_{1}-{\textbf{X}}_{2}\times_{1}C\times_{2}\cdots\times_{\mu}A_{\mu,12}\times_{\mu+1}\cdots\times_{d}A_{d}$ . Again, we first solve for ${\textbf{X}}_{2}$ and then for ${\textbf{X}}_{1}$ .

Algorithm 5 summarizes the described procedure. Compared to Algorithm 2, the largest difference is that the right-hand side updates in lines 7 and 11 require up to $d$ matrix multiplications instead of only one. While potentially having an impact on computational time, this has no impact on the asymptotic complexity, which remains $O\big{(}n^{d+1}+n_{\min}^{d}n^{d}\big{)}$ .

3.2 Merging dimensions: triangular case

In analogy to the discussion in Section 2.2, we now discuss the combination of Algorithm 5 with a merging procedure that helps to alleviate the critical dependence of its performance on $n_{\min}$ . Again, we first suppose that all coefficients triangular. This can always be achieved by a variant of Algorithm 4 that uses complex (generalized) Schur decompositions.

Line 2 of Algorithm 5 is replaced with the following procedure. When $n_{d-1}n_{d}\leq n_{\min}^{2}$ , the matrix

[TABLE]

is formed explicitly. In turn, the $d$ -dimensional tensor equation (3.1) can equivalently be viewed as the $(d-1)$ -dimensional equation

[TABLE]

with reshaped ${\textbf{X}}^{\prime},{\textbf{B}}^{\prime}\in{\mathbb{R}}^{n_{1}\times\cdots\times n_{d-2}\times n_{d-1}n_{d}}$ . For $d=2$ , this corresponds to the triangular generalized Sylvester equation $A_{1}X+CXA_{2}^{T}=B$ , for which a recursive blocked algorithm has been described in [13].

A straightforward extension of the complexity analysis of Algorithm 3 shows that Algorithm 6 requires $O\big{(}n^{d+1}+n_{\min}^{2}n^{d})$ flops.

3.3 Merging dimensions: quasi-triangular case

When using real (generalized) Schur decompositions and, in turn, dealing with upper quasi-triangular coefficients $A_{1},\ldots,A_{d}$ , we are facing a situation similar to the one discussed in Section 2.3: The merged coefficient matrix $A^{\prime}_{d-1}=A_{d}\otimes A_{d-1}$ is, in general, not quasi-triangular. The structure of $A^{\prime}_{d-1}$ is very similar but not identical with the Laplace-like case. For example, comparing (2.9) with

[TABLE]

we see that the off-diagonal blocks now have quasi-triangular instead of diagonal structure. Nevertheless, the properties and techniques discussed in Section 2.3 carry over verbatim to $A_{d-1}^{\prime}$ . In particular, $A_{d-1}^{\prime}$ is a block diagonal matrix with diagonal blocks of size at most $2n_{d-1}$ . Moreover, a perfect shuffle permutation of the diagonal blocks can again be used to further reduce the size of diagonal blocks. For example, applying this permutation to the second diagonal block of the matrix in (3.2) yields:

[TABLE]

This modification allows to apply Algorithm 6 to quasi-triangular matrices without increased complexity.

3.4 Numerical experiments

To give some insight into the performance of Algorithms 5 and 6, we have implemented them in Matlab and conducted numerical experiments in the setting described in Section 2.4. In particular, we again make use of complex (generalized) Schur decompositions, to avoid that the overhead incurred by the techniques described in Section 3.3 distorts the picture. To solve the triangular generalized Sylvester equation in Line 5 of Algorithm 6, we apply sylvester_tri to $A_{1}{\textbf{X}}^{\prime}E^{T}+C{\textbf{X}}^{\prime}(A_{2}^{\prime})^{T}=B^{\prime}$ with $E=I_{n_{2}}$ .

Choice of $n_{\min}$ .

Figure 3 shows the performance of Algorithms 5 and 6 with respect to the choice of $n_{\min}$ . Compared with Algorithms 2 and 3, see Figure 1, the findings do not differ much. In the following we set $n_{\min}=8$ for $d=3$ , $n_{\min}=6$ for $d=4$ when using Algorithm 5, and $n_{\min}=15$ for $d=3$ , $n_{\min}=13$ for $d\geq 4$ when using Algorithm 6.

Comparison.

Figure 4 compares the performance of Algorithm 5 and Algorithm 6 with the following “Sylvester solver”: After reducing the coefficients $A_{1},\ldots,A_{d},C$ to triangular form and suitably reshaping B, one of the Sylvester equations

[TABLE]

is solved for $d=3,4,5$ by calling sylvester_tri. The results from Figure 4 show that Algorithm 6 is always faster than Algorithm 5. The Sylvester solver is slower for sufficiently large $n$ ; the difference is most pronounced for $d=3$ . Moreover, the Sylvester solver encounters out of memory errors for $n>110$ , $n>50$ , $n>20$ for $d=3,4,5$ , respectively.

4 Conclusions, extensions and future work

We have extended the concept of blocked recursive algorithms to higher-order tensor equations. Both, the complexity estimates and the numerical results, clearly show the importance of combining recursion with merging dimensions in order to arrive at efficient algorithms. For third-order tensor equations, these algorithms seem to constitute the methods of choice. For fourth-order tensor equations with coefficients of nearly equal sizes, reshaping the tensor equation into a Sylvester equation and applying an existing solver is a viable alternative, provided that sufficient memory is available.

The blocked recursive algorithms developed in this work certainly admit extensions to general linear tensor equations taking the form

[TABLE]

assuming that all coefficients $A_{\mu}^{(k)}\in{\mathbb{R}}_{n_{\mu}\times n_{\mu}}$ are (quasi-)triangular. To transform general coefficients $A_{\mu}^{(k)}$ into this form requires the existence of invertible matrices $Q_{\mu},Z_{\mu}$ such that $Q_{\mu}^{T}A_{\mu}^{(k)}Z_{\mu}$ is (quasi-)triangular for every $k=1,\ldots,K$ . For $K\geq 3$ , this simultaneous triangularization is only possible under strong additional assumptions on the coefficients. A sufficient condition is that each matrix family $\{A_{\mu}^{(1)},\ldots,A_{\mu}^{(K)}\}$ contains at most two different matrices for $\mu=1,\ldots,d$ . The two classes (1.4) and (1.5) appear to constitute the practically most important examples satisfying this condition.

This work also raises an interesting open question: Is it possible to combine block recursion with low-rank compression, for example in the tensor train format [21], such that the complexity does not grow exponentially with $d$ , assuming that the involved ranks stay constant? It would also be interesting to explore which other numerical linear algebra problems allow for the combination of Kronecker product structure with block recursion. The computation of certain matrix functions, such as the matrix square root [6], appears to be a likely candidate.

Acknowledgements.

Daniel Kressner sincerely thanks Michael Steinlechner and Christine Tobler for insightful discussions on the algorithms presented in this work and their implementation.

Bibliography28

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.6. Available from http://www.sandia.gov/~tgkolda/Tensor Toolbox/ , 2015.
2[2] R. H. Bartels and G. W. Stewart. Algorithm 432: The solution of the matrix equation A X + X B = C 𝐴 𝑋 𝑋 𝐵 𝐶 AX+XB=C . Communications of the ACM , 15(9):820–826, 1972.
3[3] A. Binning. Solving second and third-order approximations to DSGE models: A recursive Sylvester equation solution. Norges Bank Working Paper 18, 2013.
4[4] E. K.-W. Chu. The solution of the matrix equations A X B − C X D = E 𝐴 𝑋 𝐵 𝐶 𝑋 𝐷 𝐸 AXB-CXD=E and ( Y A − D Z , Y C − B Z ) = ( E , F ) 𝑌 𝐴 𝐷 𝑍 𝑌 𝐶 𝐵 𝑍 𝐸 𝐹 (YA-DZ,YC-BZ)=(E,F) . Linear Algebra Appl. , 93:93–105, 1987.
5[5] T. Dayar. Kronecker modeling and analysis of multidimensional Markovian systems . Springer, Cham, 2018.
6[6] E. Deadman, N. J. Higham, and R. Ralha. Blocked Schur algorithms for computing the matrix square root , pages 171–182. Lecture Notes in Comput. Sci. 7782. Springer, 2013.
7[7] E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. , 46(1):3–45, 2004.
8[8] G. H. Golub and C. F. Van Loan. Matrix computations . Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Recursive blocked algorithms for linear systems

Abstract

1 Introduction

Linear systems with Laplace-like structure.

Generalized Sylvester equations with Kronecker structure.

2 A recursive blocked algorithm for Laplace-like equations

2.1 Recursion

2.2 Merging dimensions: triangular case

2.3 Merging dimensions: quasi-triangular case

2.4 Numerical experiments

Choice of nmin⁡n_{\min}nmin​.

Comparison.

3 A recursive blocked algorithm for generalized Sylvester equations with Kronecker structure

3.1 Recursion

3.2 Merging dimensions: triangular case

3.3 Merging dimensions: quasi-triangular case

3.4 Numerical experiments

Choice of nmin⁡n_{\min}nmin​.

Comparison.

4 Conclusions, extensions and future work

Acknowledgements.

Choice of $n_{\min}$ .

Choice of $n_{\min}$ .