Robust Task-Parallel Solution of the Triangular Sylvester Equation
Angelika Schwarz, Carl Christian Kjelgaard Mikkelsen

TL;DR
This paper introduces a robust, task-parallel solver for the triangular Sylvester equation that prevents overflow through dynamic scaling, matching the performance of non-robust solvers while expanding solvable problem scope.
Contribution
It develops a level-3 BLAS-based task-parallel solver with overflow protection, enhancing robustness without sacrificing performance.
Findings
Achieves robustness comparable to LAPACK's dtrsyl
Prevents overflow in backward substitution
Maintains similar performance to non-robust solvers
Abstract
The Bartels-Stewart algorithm is a standard approach to solving the dense Sylvester equation. It reduces the problem to the solution of the triangular Sylvester equation. The triangular Sylvester equation is solved with a variant of backward substitution. Backward substitution is prone to overflow. Overflow can be avoided by dynamic scaling of the solution matrix. An algorithm which prevents overflow is said to be robust. The standard library LAPACK contains the robust scalar sequential solver dtrsyl. This paper derives a robust, level-3 BLAS-based task-parallel solver. By adding overflow protection, our robust solver closes the gap between problems solvable by LAPACK and problems solvable by existing non-robust task-parallel solvers. We demonstrate that our robust solver achieves a similar performance as non-robust solvers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
11institutetext: Department of Computing Science, Umeå University, Sweden
11email: {angies,spock}@cs.umu.se
Robust Task-Parallel Solution of the Triangular Sylvester Equation
Angelika Schwarz
Carl Christian Kjelgaard Mikkelsen
Abstract
The Bartels-Stewart algorithm is a standard approach to solving the dense Sylvester equation. It reduces the problem to the solution of the triangular Sylvester equation. The triangular Sylvester equation is solved with a variant of backward substitution. Backward substitution is prone to overflow. Overflow can be avoided by dynamic scaling of the solution matrix. An algorithm which prevents overflow is said to be robust. The standard library LAPACK contains the robust scalar sequential solver dtrsyl. This paper derives a robust, level-3 BLAS-based task-parallel solver. By adding overflow protection, our robust solver closes the gap between problems solvable by LAPACK and problems solvable by existing non-robust task-parallel solvers. We demonstrate that our robust solver achieves a similar performance as non-robust solvers.
Keywords:
Overflow protection, task parallelism, triangular Sylvester equation, real Schur form
1 Introduction
The Bartels-Stewart algorithm is a standard approach to solving the general Sylvester equation
[TABLE]
where , and are dense. It is well-known that (1) has a unique solution if and only if the eigenvalues of and of satisfy for all and .
Sylvester equations occur in numerous applications including control and systems theory, signal processing and condition number estimation, see [8] for a summary. The case of corresponds to the continuous-time Lyapunov matrix equation, which is central in the analysis of linear time-invariant dynamical systems.
The Bartels-Stewart algorithm solves (1) by reducing and to upper quasi-triangular form and . This reduces the problem to the solution of the triangular Sylvester equation
[TABLE]
During the solution of (2) through a variant of backward substitution, the entries of can exhibit growth, possibly exceeding the representable floating-point range. To avoid such an overflow, the LAPACK 3.7.0 routine dtrsyl uses a scaling factor to dynamically downscale the solution. We say that an algorithm is robust if it cannot exceed the overflow threshold . With the scaling factor , the triangular Sylvester equation reads This paper focuses on the robust solution of the triangular Sylvester equation and improves existing non-robust task-parallel implementations for (2) by adding protection against overflow. This closes the gap between the class of problems solvable by existing task-parallel solvers and the class of problems solvable by LAPACK. Consequently, more problems can be solved efficiently in parallel through the Bartels-Stewart method.
We now describe the Bartels-Stewart algorithm for solving (1), see [1, 2, 8]. The algorithm computes the real Schur decompositions of and
[TABLE]
using orthogonal transformations and . Using (3), the general Sylvester equation (1) is transformed into the triangular Sylvester equation
[TABLE]
The solution of the original system (1) is given by . The real Schur forms and attain the shapes
[TABLE]
where the diagonal blocks and are either 1-by-1 or 2-by-2. The right-hand side is partitioned conformally
[TABLE]
Adopting a block perspective, (2) reads
[TABLE]
for all blocks and . A straight-forward implementation of (6) is Algorithm 1. The algorithm starts at the bottom left corner () and processes the block columns bottom-up from left to right. The flop count of the algorithm approximately corresponds to two backward substitutions and amounts to flops.
The stability of Algorithm 1 has been summarized in Higham [3]. In essence, the algorithm inherits the stability from backward substitution provided that the small Sylvester equations (line 4) can be solved stably.
The rest of this paper is structured as follows. In Section 2 we formalize the definition of a robust algorithm and derive a new robust algorithm for solving the triangular Sylvester equation. Section 3 describes the execution environment used in the numerical experiments in Section 4. Section 5 summarizes the results.
2 Robust Algorithms for Triangular Sylvester Equations
In this section, we address the robust solution of the triangular Sylvester equation. The goal is to dynamically compute a scaling factor such that the solution of the scaled triangular Sylvester equation
[TABLE]
can be obtained without ever exceeding the overflow threshold . We derive two robust algorithms for solving (7). The first scalar algorithm can be viewed as an enhancement of LAPACK’s dtrsyl by adding overflow protection to the linear updates. The second tiled algorithm redesigns the first algorithm such that most of the computation is executed as matrix-matrix multiplications.
2.1 Scalar Robust Algorithm
The central building block for adding robustness is ProtectUpdate, introduced by Kjelgaard Mikkelsen and Karlsson in [5]. ProtectUpdate computes a scaling factor such that the matrix update cannot overflow. ProtectUpdate uses the upper bounds , and to evaluate the maximum growth possible in the update. Provided that , and , ProtectUpdate computes a scaling factor such that .
We use ProtectUpdate to protect the left and the right updates in the triangular Sylvester equation. We protect right updates by applying the scaling factor as . We protect left updates by appling the scaling factor as .
A solver for the triangular Sylvester equation requires the solution of small Sylvester equations , . Since and are at most 2-by-2, these small Sylvester equations can be converted into linear systems of size at most 4-by-4, see [1]. We solve these linear systems robustly through Gaussian elimination with complete pivoting. This process requires linear updates and divisions to be executed robustly. We use ProtectUpdate for the small linear updates and guard divisions with ProtectDivision from [5].
Algorithm 2 RobustSyl adds overflow protection to Algorithm 1. RobustSyl is dominated by level-2 BLAS-like thin linear updates. The next section develops a solver, which relies on efficient level-3 BLAS operations and uses RobustSyl as a basic building block.
2.2 Tiled Robust Algorithm
Solvers for the triangular Sylvester equation can be expressed as tiled algorithms [7] such that most of the computation corresponds to matrix-matrix multiplications. For this purpose, the matrices are partitioned into conforming, contiguous submatrices, so-called tiles. In order to decouple the tiles, the global scaling factor is replaced with local scaling factors, one per tile of . The association of a tile with a scaling factor leads to augmented tiles.
Definition 1
An augmented tile consists of a scalar and a matrix and represents the scaled matrix . We say that two augmented tiles and are equivalent and we write if and only if .
Definition 2
We say that two augmented tiles and are consistently scaled if .
The idea of associating a scaling factor with a vector was introduced by Kjelgaard Mikkelsen and Karlsson [5, 6] who use augmented vectors to represent scaled vectors. We generalize their definition and associate a scaling factor with a tile. Their definition of consistently scaled vectors generalizes likewise to consistently scaled tiles.
Let and be as in (4). A partitioning of into -by- tiles and into -by- tiles such that 2-by-2 blocks on the diagonals are not split and the diagonal tiles are square induces a partitioning of and . We then solve the tiled equations
[TABLE]
without explicitly forming any of the products , and . Note that (8) is structurally identical to (6). The solution of (8) requires augmented tiles to be updated. Algorithm 3 RobustUpdate executes such an update robustly. Combining RobustUpdate and RobustSyl leads to Algorithm 4, which solves (7) in a tiled fashion. The global scaling factor corresponds to the smallest of the local scaling factors . The solution is obtained by scaling the tiles consistently.
Algorithm 4 can be parallelized with tasks. Each function call corresponds to a task. RobustSyl on has outgoing dependences to RobustUpdate modifying , and , . The incoming dependences of a RobustSyl task on are satisfied when all updates modifying have been completed. The updates to require exclusive write access, which we achieve with a lock.
RobustUpdate tasks rely on upper bounds and . Since the computation of norms is expensive, the matrix norms are precomputed and recorded. This is realized through perfectly parallel Bound tasks. To limit the amount of dependences to be handled by the runtime system, a synchronization point separates this preprocessing step and RobustSyl/RobustUpdate tasks.
Another synchronization point precedes the consistency scaling. The local scaling factors are reduced sequentially to the global scaling factor . The consistency scaling of the tiles is executed with independent Scale tasks.
3 Experiment Setup
This section describes the setup of the numerical experiments. We specify the hardware, the solvers and their configuration, and the matrices of the triangular Sylvester equation.
Execution Environment
The experiments were executed on an Intel Xeon E5-2690v4 (“Broadwell”) node with 28 cores arranged in two NUMA islands with 14 cores each. The theoretical peak performance in double-precision arithmetic is 41.6 GFLOPS/s for one core and 1164.8 GFLOPS/s for a full node. In the STREAM triad benchmark the single core memory bandwidth was measured at 19 MB/s; the full node reached 123 MB/s.
We use the GNU compiler 6.4.0 and link against single-threaded OpenBLAS 0.2.20 and LAPACK 3.7.0. The compiler optimization flags are -O2 -xHost. We forbid migration of threads and fill one NUMA island with threads before assigning threads to the second NUMA island by setting OMP_PROC_BIND to close.
Software
This section describes the routines used in the numerical experiments. The first two routines are non-robust, i.e., the routines solve .
- •
FLA_Sylv. The libflame version 5.1.0-58 [9, 7] solver partitions the problem into tiles and executes the linear updates as matrix-matrix multiplications.
- •
FLASH_Sylv. This libflame routine is the supermatrix version of FLA_Sylv and introduces task parallelism.
The following three routines are robust and solve .
- •
dtrsyl. The LAPACK 3.7.0 routine realizes overflow protection with a global scaling factor. Any scaling events triggers scaling of the entire matrix .
- •
recsy. The Fortran library release 2009-12-28 [4] offers recursive blocked solvers for a variety of Sylvester and Lyapunov equations. If overflow protection is triggered at some recursion level, the required scaling is propagated.
- •
drsylv. Our solver can be viewed as a robust version of FLASH_Sylv. Due to the usage of local scaling factors, our solver exhibits the same degree of parallelism as FLASH_Sylv.
Test Matrices
We design the system matrices such that the growth during the solve is controlled. This allows us to (a) design systems that the non-robust solvers must be able to solve and (b) examine the cost of robustness by increasing the growth and, in turn, the amount of scaling necessary.
The matrix and the upper triangular part of and are filled with ones. The diagonal blocks are set to and , where is either the 1-by-1 block given by or the 2-by-2 block given by
[TABLE]
The magnitude of the diagonal entries of and is controlled by and . This also holds for 2-by-2 blocks, which encode a complex conjugate pair of eigenvalues. The 2-by-2 blocks cannot be reduced to triangular form using a real similarity transformation. A unitary transformation, however, can transform the matrix into triangular shape. As an example, consider the transformation into triangular shape for the 5-by-5 matrix
[TABLE]
It is clear that a small value of introduces growth during the backward substitution. Hence, the choice of and control the growth during the solve.
Tuning
The tile sizes for FLASH_Sylv and drsylv were tuned using and . For each core count, a sweep over [100, 612] with a step size of 32 was evaluated three times and the tile size with the best median runtime was chosen.
Reliability
We extend the relative residual defined by Higham [3] with the scaling factor and evaluate
[TABLE]
where . We report the median runtime of three runs.
4 Performance Results
This section presents three sets of performance results. First, the five solvers are executed in sequential mode. Second, the parallel scalability is analyzed. Third, the cost of robustness is investigated.
Sequential Comparison without Numerical Scaling
Figure 1 compares the solvers using and . With this choice, dynamic downscaling of is not necessary. Our solver drsylv is slightly slower than FLA_Sylv and FLASH_Sylv, possibly because the overhead from robustness cannot be amortized. Since the gap between recsy and dtrsyl on the one hand and FLA_Sylv, FLASH_Sylv and drsylv on the other hand grows with increasing matrix sizes, we restrict the parallel experiments to a comparison between drsylv and FLASH_Sylv.
Strong Scalability without Numerical Scaling
We examine the strong scalability of FLASH_Syl and drsylv on one shared memory node. Perfect strong scalability manifests itself in a constant efficiency when the number of cores is increased for a fixed problem. Figure 2 shows the results for and . Robustness as implemented in drsylv does not hamper the scalability on systems that do not require scaling.
Cost of Robustness
Figure 3 (right) analyzes the cost of robustness. The amount of scaling necessary is controlled by fixing and varying . The experiments with do not require scaling. The choice triggers scaling for a few tiles. Frequent scaling is required for . The cost of RobustUpdate increases when some of the three input tiles require scaling. In the worst case all three tiles have to be rescaled in every update. Figure 3 (left) shows that despite of robustness a decent fraction of the peak performance is reached.
5 Conclusion
The solution of the triangular Sylvester equation is a central step in the Bartels-Stewart algorithm. During the backward substitution, the components of the solution can exhibit growth, possibly exceeding the representable floating-point range. This paper introduced a task-parallel solver with overflow protection. By adding overflow protection, task-parallel solvers can now solve the same set of problems that is solvable with LAPACK.
The numerical experiments revealed that the overhead of overflow protection is negligible when scaling is not needed. Hence, the non-robust solver offers no real advantage over our robust solver. When scaling is necessary, our robust algorithm automatically applies scaling to prevent overflow. While scaling increases the runtime, it guarantees a representable result. The computed solution can be evaluated in the context of the user’s application. This certainty cannot be achieved with a non-robust algorithm.
Acknowledgements
The authors thank the research group for their support. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671633. Support was received by eSSENCE, a collaborative e-Science programme funded by the Swedish Government via the Swedish Research Council (VR).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Bartels, R.H., Stewart, G.W.: Solution of the Matrix Equation A X + X B = C 𝐴 𝑋 𝑋 𝐵 𝐶 AX+XB=C . Communications of the ACM 15 (9), 820–826 (1972)
- 2[2] Golub, G.H., Van Loan, C.F.: Matrix Computations. John Hopkins University Press, 3rd edn. (1996)
- 3[3] Higham, N.J.: Accuracy and Stability of Numerical Algorithms. SIAM, 2nd edn. (2002)
- 4[4] Jonsson, I., Kågström, B.: Recursive blocked algorithms for solving triangular systems – Part I: One-sided and coupled Sylvester-type matrix equations. ACM TOMS 28 (4), 392–415 (2002)
- 5[5] Kjelgaard Mikkelsen, C.C., Karlsson, L.: Robust Solution of Triangular Linear Systems (2017), NLAFET Working Note 9
- 6[6] Mikkelsen, C.C.K., Karlsson, L.: Blocked Algorithms for Robust Solution of Triangular Linear Systems. In: International Conference on Parallel Processing and Applied Mathematics. pp. 68–78. Springer (2017)
- 7[7] Quintana-Ortí, E.S., Van De Geijn, R.A.: Formal Derivation of Algorithms: The Triangular Sylvester Equation. ACM TOMS 29 (2), 218–243 (2003)
- 8[8] Simoncini, V.: Computational Methods for Linear Matrix Equations. SIAM Review 58 (3), 377–441 (2016)
