Recursive blocked algorithms for linear systems with Kronecker product structure
Minhong Chen, Daniel Kressner

TL;DR
This paper extends recursive blocked algorithms to higher-dimensional Sylvester-like equations, enabling efficient solutions for PDE discretizations and economic models, outperforming existing methods.
Contribution
It introduces a novel recursive algorithm that handles higher-dimensional Kronecker-structured equations more efficiently than previous approaches.
Findings
Algorithm outperforms existing Sylvester solvers
Efficiently solves PDE discretizations with separable coefficients
Applicable to macroeconomic model approximations
Abstract
Recursive blocked algorithms have proven to be highly efficient at the numerical solution of the Sylvester matrix equation and its generalizations. In this work, we show that these algorithms extend in a seamless fashion to higher-dimensional variants of generalized Sylvester matrix equations, as they arise from the discretization of PDEs with separable coefficients or the approximation of certain models in macroeconomics. By combining recursions with a mechanism for merging dimensions, an efficient algorithm is derived that outperforms existing approaches based on Sylvester solvers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMatrix Theory and Algorithms · Tensor decomposition and applications · Model Reduction and Neural Networks
Recursive blocked algorithms for linear systems
with Kronecker product structure
Minhong Chen111Department of Mathematics, Zhejiang Sci-Tech University, Hangzhou, 310029, Zhejiang, P.R.China, [email protected]. The work of this author was supported by the National Natural Science Foundation of China (Grant No. 11801513).
Daniel Kressner222Institute of Mathematics, EPF Lausanne, 1015 Lausanne, Switzerland, [email protected].
Abstract
Recursive blocked algorithms have proven to be highly efficient at the numerical solution of the Sylvester matrix equation and its generalizations. In this work, we show that these algorithms extend in a seamless fashion to higher-dimensional variants of generalized Sylvester matrix equations, as they arise from the discretization of PDEs with separable coefficients or the approximation of certain models in macroeconomics. By combining recursions with a mechanism for merging dimensions, an efficient algorithm is derived that outperforms existing approaches based on Sylvester solvers.
1 Introduction
In computations with matrices, recursive blocked algorithms offer an elegant way to arrive at implementations that benefit from increased data locality and efficiently utilize highly tuned kernels. See [7] for a survey and [22] for a more recent testimony of this principle. These algorithms have proven particularly effective for solving Sylvester equations, that is, matrix equations of the form
[TABLE]
where , are given and is unknown. In the Bartels-Stewart algorithm [2], the matrices and are first reduced to block upper form by real Schur decompositions. The reduced problem is then solved by a variant of backward substitution. Both stages of the algorithms require operations, with . Entirely consisting of level 2 BLAS operations, the backward substitution step performs quite poorly. To avoid this, Jonsson and Kågström [12, 13] have proposed recursive algorithms for triangular Sylvester and related matrix equations. The recursive algorithm for solving (1.1) with upper quasi-triangular starts with partitioning the matrix of larger size. Assuming , let with such that and partition , correspondingly. Then (1.1) becomes equivalent to
[TABLE]
First the Sylvester equation (1.2b) is solved recursively, then the right-hand side (1.2a) is updated, and finally (1.2a) is solved recursively. Apart from the solution of small-sized Sylvester equations at the lowest recursion level, the entire algorithm consists of matrix-matrix multiplications and thus attains high performance by leveraging level 3 BLAS. As emphasized in [7, 22], recursive algorithm are less sensitive to parameter tuning compared to blocked algorithms.
The described algorithm extends to generalized and coupled Sylvester equations, such as ; see [13, 23]. Interestingly, the numerically stable recursive formulation of Hammarling’s method [11] for solving stable Lyapunov equations remains an open problem [16].
In this paper, we propose several new extensions that address high-dimensional variants of Sylvester equations. More specifically, we aim at computing a tensor satisfying the linear equation
[TABLE]
where is a linear operator and . For , this formulation includes the Sylvester equation (1.1) and its generalizations mentioned above as special cases. For example, for (1.1) the matrix representation of is given by .
The operator needs to be of a very particular form such that (1.3) is amenable to the techniques discussed in this work. Motivated by their relevance in applications, we focus on two classes of operators.
Linear systems with Laplace-like structure.
In Section 2, we consider discrete Laplace-like operators having the matrix representation
[TABLE]
with , . Using the vectorization of tensors, (1.3) can equivalently be written as . Discrete Laplace-like operators arise from the structured discretization of -dimensional PDEs with separable coefficients on tensorized domains. For more general PDEs, matrices of the form (1.4) can sometimes be used to construct effective preconditioners; see [24, 25] for examples. Other applications of (1.4) arise from Markov chain models [5, 26] used, e.g., for simulating interconnected systems.
Generalized Sylvester equations with Kronecker structure.
Section 3 is concerned with the second class of operators considered in this work, which have a matrix representation of the form
[TABLE]
with for and . For , the linear system (1.3) now becomes equivalent to the generalized Sylvester equation . For , we can view (1.3) equivalently as a generalized Sylvester equations with coefficients that feature Kronecker structure. If for some then
[TABLE]
Linear systems featuring such shifted Kronecker products have been discussed in [19]. The more general case (1.5) arises from approximations of discrete time DSGE models [3], which play a central role in macroeconomics.
Recent work on the solution of linear tensor equations (1.3) has focused on the development of highly efficient approximate and iterative solvers that assume and exploit low-rank tensor structure in the right-hand side and the solution; see [9, 10] for overviews. In some cases, these developments can be combined with the methods developed in this work, which do not assume any such structure. For example, if the tensor Krylov subspace method [17] is applied to (1.4) for large-scale coefficients then our method can be used to solve the smaller-sized linear systems occurring in the method. As far as we know, all existing direct non-iterative solvers for linear tensor equations combine the Bartels-Stewart method for (generalized) Sylvester equations with a recursive traversal of the dimension. Instances of this approach can be found in [18, 27] for (1.4), in [14] for (1.5), and in [19] for (1.6). For , we are not aware of any work on (recursive) blocked methods that would allow for the effective use of level 3 BLAS.
2 A recursive blocked algorithm for Laplace-like equations
Let us first recall two basic operations for tensors from [15]. The th matricization of a tensor is the matrix obtained by mapping the th index to the rows and all other indices to the columns:
[TABLE]
with the column index defined via the index map
[TABLE]
The -mode matrix multiplication of X with a matrix is the tensor satisfying . This allows us to rewrite (1.3)–(1.4) as
[TABLE]
It is well known that this equation has a unique solution if and only if for any eigenvalues of , of , etc. In the following, we will assume that this condition is satisfied.
Algorithm 1 describes our general framework for solving (2.2). Using real Schur decompositions [8, Sec. 7.4], the coefficient matrices are first transformed to reduced form. More specifically, for each an orthogonal matrix is computed such that is in upper quasi-triangular form, that is, is an upper block triangular matrix with blocks containing its real eigenvalues and blocks containing its complex eigenvalues in conjugate pairs. The right-hand side and the solution tensor need to be transformed accordingly by -mode matrix multiplications. For the rest of this section, we focus on line 3 of Algorithm 1, that is, the solution of the tensor equation with the reduced coefficients.
2.1 Recursion
By Algorithm 1, we may assume that are already in upper quasi-triangular form. Choose such that and such that and . Partitioning with , equation (2.2) becomes equivalent to
[TABLE]
where
[TABLE]
and are defined analogously. Noting that (2.3b) and (2.3a) are again equations with Laplace-like operators, they can be solved recursively. The recursion is stopped once the maximal size is below a user-specified block size . These considerations lead to Algorithm 2.
Let denote the complexity of Algorithm 2 for even . On the top level of recursion Algorithm 2 is applied to one tensor, on the second level to two tensors, on the third level to four tensors, and so on. Under the slightly simplified assumption that the multiplication of an quasi-triangular matrix with a vector requires floating point operations (flops), each level of the first recursions requires a total of flops to execute the matrix-matrix multiplications in line 6 of Algorithm 2. After recursions of Algorithm 2, has been reduced to in each mode and, therefore, Assuming that is a power of two, we obtain
[TABLE]
Once the maximal size of the tensor is or below, line 2 of Algorithm 2 assembles the matrix defined in (1.4) and solves the block triangular linear system by backward substitution. This requires O\big{(}(n_{\min})^{2d}\big{)} flops and therefore
[TABLE]
This compares favorably with the operations needed by backward substitution applied to the assembled full triangular linear system. The complexity estimate (2.6) also reflects the critical role played by the solution of the small systems in line 2. On the one hand, the operation count suggests to choose as small as possible, say, . On the other hand, it has been observed for in [12] that a small value of creates significant overhead and requires very well tuned kernels. In the following section, we describe a technique that alleviates this difficulty.
2.2 Merging dimensions: triangular case
To avoid the critical dependence on observed in (2.6) we replace line 2 of Algorithm 2 by the following procedure. Once , the matrix
[TABLE]
is formed explicitly. For the moment, let us suppose that and are upper triangular. This can be achieved by computing complex instead of real Schur decompositions in Algorithm 1, leading to a triangular tensor equation with complex coefficients. Because of roundoff error, the computed solution to the original equation will now have a (small) imaginary part. This can be safely set to zero [20].
The matrix inherits the triangular structure from and the -dimensional tensor equation (2.2) is equivalent to the -dimensional equation
[TABLE]
with reshaped . This equation is solved recursively. A major advantage, this approach allows us to reduce . For , the system (2.8) becomes the triangular Sylvester equation
[TABLE]
to which the efficient solvers described in Section 1 can be applied. Note that now refers to the complex transpose of . Algorithm 3 summarizes the proposed procedure.
To analyze the complexity of Algorithm 3 for , we observe that all sizes are first reduced to or below before the condition in line 1 is met. Hence, up to constant factors the recursive estimate (2.5) holds and it remains to discuss the complexity for , which will be denoted by . The merge in line 2 reduces the order to but increases the first mode size to . Approximately recursions are needed to reduce it back to . Similarly as in Section 2.1 we calculate
[TABLE]
For , the solution of the triangular Sylvester equation in line 5 requires O\big{(}n_{\min}^{5}\big{)} flops. In turn, \overline{\mathsf{comp}}_{d}(n_{\min})=O\big{(}n_{\min}^{d+2}\big{)}. Inserted into (2.5), we arrive at
[TABLE]
flops for Algorithm 3. For , this compares favorably with the complexity estimate (2.6) for Algorithm 2; the dependence on has been reduced significantly. Equally importantly, Algorithm 3 allows us to leverage efficient solvers for triangular Sylvester equations, such as the ones described in [12].
2.3 Merging dimensions: quasi-triangular case
The use of complex arithmetic, which increases the cost (by a constant factor) in terms of operations and memory, can be avoided when using the real Schur form and working with quasi-triangular coefficients. However, a few modifications are needed because the matrix formed in (2.7) does not inherit the quasi-triangular structure from and . To illustrate what happens, let us consider the following example for :
[TABLE]
The diagonal matrix at the (3,2) block disturbs the quasi-triangular structure of . More generally, assuming the matrix is an block upper triangular matrix with diagonal blocks of size at most . This matrix can be returned to quasi-triangular form by computing a real Schur decomposition of . The impact of this operation on the overall cost of Algorithm 3 can be made negligible by exploiting the structure of :
- •
When the structure of is completely ignored, its real Schur decomposition takes flops and, in turn, the complexity of Algorithm 3 increases to O\big{(}n^{d+1}+n_{\min}^{3}n^{d}\big{)}.
- •
When the block triangular structure of is taken into account, the cost of computing its real Schur decomposition reduces to flops. When used within Algorithm 3, the additional flops spent on performing these decompositions and applying the resulting orthogonal transformations amounts to O\big{(}n_{\min}^{2}n^{d}\big{)} in total. In turn, this operation does not increase the complexity of Algorithm 3 but its dependence on is not negligible either.
- •
The diagonal structure of the off-diagonal blocks of can be exploited to reduce the cost further, using a permutation trick similar to the one discussed in [19]. To illustrate this, consider the matrix from (2.9). By applying a perfect shuffle permutation [28] to the last rows and columns, we obtain the permuted matrix
[TABLE]
In the general case, applying such a permutation to each diagonal block transforms into a block upper triangular matrix with diagonal blocks of size at most . This reduces the cost of computing its real Schur decomposition to flops and the overall impact of this operation on the cost of Algorithm 3 becomes negligible.
2.4 Numerical experiments
All algorithms proposed in this work have been implemented in Matlab R2019a and executed on a Lenovo ThinkPad T460, which comes with an Intel Core i5-6300U processor and 8 Gbytes of DDR3L-RAM. The implementation of the algorithms together with scripts for reproducing each of the experiments reported in this work are available from https://anchp.epfl.ch/misc/.
Care has been taken to avoid unnecessary overhead in our Matlab implementation. For example, the tensor object from the Tensor Toolbox [1] is very convenient for realizing tensor operations but our preliminary experiments indicated that its use in Algorithms 2 and 3 would lead to significant performance loss, possibly due to excessive memory transfer. Instead, we directly use Matlab arrays, combined with the permute and reshape functions for implementing -mode matrix multiplications. For solving triangular Sylvester equations, as needed, e.g., in Algorithm 3, we utilize the internal Matlab function sylvester_tri. This function seems to be based on the algorithms presented in [12, 13] and avoids performing any additional Schur decomposition.
The techniques from Section 2.3, which allow for the use of real arithmetic in Algorithm 3, have been implemented and verified. However, we observed that none of the three described variants leads to competitive performance, any benefit from structure exploitation is offset by the overhead it incurs in Matlab, due to the relatively small values of needed for reaching good performance. In the following, we therefore consistently use complex Schur decompositions for reducing all coefficients to triangular form. All reported times include the time needed by Algorithm 1 for performing these decompositions and applying the corresponding transformations. The coefficients used in our experiments have been generated with randn.
Choice of .
Figure 1 shows the execution times obtained for fixed and varying . All numbers have been averaged over five consecutive runs. As to be expected from the complexity estimates, the performance of Algorithm 2 is very sensitive to the choice of , especially for . The smallest execution times are attained by for and for . The performance of Algorithm 3 is not very sensitive to the choice of , provided that its value is not chosen too small. The smallest execution times are attained by for , for , and for . These values of are used in the following.
Comparison.
We have compared our newly proposed algorithms with the following procedure termed “Sylvester solver”: After reducing the coefficients of the Laplace-like equation (2.2) to triangular form and reshaping B suitably into a matrix , one of the Sylvester equations
[TABLE]
is solved for by calling sylvester_tri. The results reported in Figure 2 confirm that Algorithms 2 and 3 have the same asymptotic cost. However, Algorithm 3 is always faster, by an order of magnitude for sufficiently large . For , the Sylvester solver is nearly always slower than Algorithm 3. For , the picture is less clear; only for becomes Algorithm 3, which has complexity , consistently faster than the Sylvester solver, which has complexity . For , the difference in complexity is more pronounced and, in turn, Algorithm 3 is nearly always faster.
For all experiments performed, the norm of the residual was checked and no significant differences in terms of numerical stability were observed between the different algorithms tested.
3 A recursive blocked algorithm for generalized Sylvester equations with Kronecker structure
In this section, we extend the developments from Section 2 to the second class of operators considered in this work, which have the matrix representation (1.5). The corresponding linear system reads in tensor notation as
[TABLE]
Because of its connection to generalized Sylvester equations [4] explained in the introduction, this equation has a unique solution if and only if the matrix pencil is regular and none of its eigenvalues is an eigenvalue of . In the following, we assume that this condition is satisfied.
Algorithm 4 is the equivalent of Algorithm 1 for reducing (3.1) to quasi-triangular form. The most notable difference is that now a generalized Schur decomposition [8, Sec. 7.7.2] of needs to be computed, using the QZ algorithm.
3.1 Recursion
The rest of this section is concerned with line 4 of Algorithm 4, solving (3.1) with upper quasi-triangular coefficients and upper triangular . Again we proceed recursively and choose such that and such that and . We partition , and split the tensors X and B along their th mode into and , respectively, in accordance with (2.4).
Case 1: . We additionally partition and decouple (3.1) along the first mode:
[TABLE]
with . Both equations take the form of the tensor equation (3.1) with (quasi-)triangular coefficients. We recursively solve for and then solve for , after computing .
Case 2: . Decoupling (3.1) along the th mode gives the two tensor equations
[TABLE]
with . Again, we first solve for and then for .
Algorithm 5 summarizes the described procedure. Compared to Algorithm 2, the largest difference is that the right-hand side updates in lines 7 and 11 require up to matrix multiplications instead of only one. While potentially having an impact on computational time, this has no impact on the asymptotic complexity, which remains O\big{(}n^{d+1}+n_{\min}^{d}n^{d}\big{)}.
3.2 Merging dimensions: triangular case
In analogy to the discussion in Section 2.2, we now discuss the combination of Algorithm 5 with a merging procedure that helps to alleviate the critical dependence of its performance on . Again, we first suppose that all coefficients triangular. This can always be achieved by a variant of Algorithm 4 that uses complex (generalized) Schur decompositions.
Line 2 of Algorithm 5 is replaced with the following procedure. When , the matrix
[TABLE]
is formed explicitly. In turn, the -dimensional tensor equation (3.1) can equivalently be viewed as the -dimensional equation
[TABLE]
with reshaped . For , this corresponds to the triangular generalized Sylvester equation , for which a recursive blocked algorithm has been described in [13].
A straightforward extension of the complexity analysis of Algorithm 3 shows that Algorithm 6 requires O\big{(}n^{d+1}+n_{\min}^{2}n^{d}) flops.
3.3 Merging dimensions: quasi-triangular case
When using real (generalized) Schur decompositions and, in turn, dealing with upper quasi-triangular coefficients , we are facing a situation similar to the one discussed in Section 2.3: The merged coefficient matrix is, in general, not quasi-triangular. The structure of is very similar but not identical with the Laplace-like case. For example, comparing (2.9) with
[TABLE]
we see that the off-diagonal blocks now have quasi-triangular instead of diagonal structure. Nevertheless, the properties and techniques discussed in Section 2.3 carry over verbatim to . In particular, is a block diagonal matrix with diagonal blocks of size at most . Moreover, a perfect shuffle permutation of the diagonal blocks can again be used to further reduce the size of diagonal blocks. For example, applying this permutation to the second diagonal block of the matrix in (3.2) yields:
[TABLE]
This modification allows to apply Algorithm 6 to quasi-triangular matrices without increased complexity.
3.4 Numerical experiments
To give some insight into the performance of Algorithms 5 and 6, we have implemented them in Matlab and conducted numerical experiments in the setting described in Section 2.4. In particular, we again make use of complex (generalized) Schur decompositions, to avoid that the overhead incurred by the techniques described in Section 3.3 distorts the picture. To solve the triangular generalized Sylvester equation in Line 5 of Algorithm 6, we apply sylvester_tri to with .
Choice of .
Figure 3 shows the performance of Algorithms 5 and 6 with respect to the choice of . Compared with Algorithms 2 and 3, see Figure 1, the findings do not differ much. In the following we set for , for when using Algorithm 5, and for , for when using Algorithm 6.
Comparison.
Figure 4 compares the performance of Algorithm 5 and Algorithm 6 with the following “Sylvester solver”: After reducing the coefficients to triangular form and suitably reshaping B, one of the Sylvester equations
[TABLE]
is solved for by calling sylvester_tri. The results from Figure 4 show that Algorithm 6 is always faster than Algorithm 5. The Sylvester solver is slower for sufficiently large ; the difference is most pronounced for . Moreover, the Sylvester solver encounters out of memory errors for , , for , respectively.
4 Conclusions, extensions and future work
We have extended the concept of blocked recursive algorithms to higher-order tensor equations. Both, the complexity estimates and the numerical results, clearly show the importance of combining recursion with merging dimensions in order to arrive at efficient algorithms. For third-order tensor equations, these algorithms seem to constitute the methods of choice. For fourth-order tensor equations with coefficients of nearly equal sizes, reshaping the tensor equation into a Sylvester equation and applying an existing solver is a viable alternative, provided that sufficient memory is available.
The blocked recursive algorithms developed in this work certainly admit extensions to general linear tensor equations taking the form
[TABLE]
assuming that all coefficients are (quasi-)triangular. To transform general coefficients into this form requires the existence of invertible matrices such that is (quasi-)triangular for every . For , this simultaneous triangularization is only possible under strong additional assumptions on the coefficients. A sufficient condition is that each matrix family contains at most two different matrices for . The two classes (1.4) and (1.5) appear to constitute the practically most important examples satisfying this condition.
This work also raises an interesting open question: Is it possible to combine block recursion with low-rank compression, for example in the tensor train format [21], such that the complexity does not grow exponentially with , assuming that the involved ranks stay constant? It would also be interesting to explore which other numerical linear algebra problems allow for the combination of Kronecker product structure with block recursion. The computation of certain matrix functions, such as the matrix square root [6], appears to be a likely candidate.
Acknowledgements.
Daniel Kressner sincerely thanks Michael Steinlechner and Christine Tobler for insightful discussions on the algorithms presented in this work and their implementation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.6. Available from http://www.sandia.gov/~tgkolda/Tensor Toolbox/ , 2015.
- 2[2] R. H. Bartels and G. W. Stewart. Algorithm 432: The solution of the matrix equation A X + X B = C 𝐴 𝑋 𝑋 𝐵 𝐶 AX+XB=C . Communications of the ACM , 15(9):820–826, 1972.
- 3[3] A. Binning. Solving second and third-order approximations to DSGE models: A recursive Sylvester equation solution. Norges Bank Working Paper 18, 2013.
- 4[4] E. K.-W. Chu. The solution of the matrix equations A X B − C X D = E 𝐴 𝑋 𝐵 𝐶 𝑋 𝐷 𝐸 AXB-CXD=E and ( Y A − D Z , Y C − B Z ) = ( E , F ) 𝑌 𝐴 𝐷 𝑍 𝑌 𝐶 𝐵 𝑍 𝐸 𝐹 (YA-DZ,YC-BZ)=(E,F) . Linear Algebra Appl. , 93:93–105, 1987.
- 5[5] T. Dayar. Kronecker modeling and analysis of multidimensional Markovian systems . Springer, Cham, 2018.
- 6[6] E. Deadman, N. J. Higham, and R. Ralha. Blocked Schur algorithms for computing the matrix square root , pages 171–182. Lecture Notes in Comput. Sci. 7782. Springer, 2013.
- 7[7] E. Elmroth, F. Gustavson, I. Jonsson, and B. Kågström. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. , 46(1):3–45, 2004.
- 8[8] G. H. Golub and C. F. Van Loan. Matrix computations . Johns Hopkins University Press, Baltimore, MD, fourth edition, 2013.
