Program Generation for Linear Algebra Using Multiple Layers of DSLs

Daniele G. Spampinato (1); Diego Fabregat-Traver (2); Markus P\"uschel; (1); Paolo Bientinesi (2) ((1) ETH Zurich; (2) RWTH Aachen University)

arXiv:1906.08613·cs.MS·June 21, 2019

Program Generation for Linear Algebra Using Multiple Layers of DSLs

Daniele G. Spampinato (1), Diego Fabregat-Traver (2), Markus P\"uschel, (1), Paolo Bientinesi (2) ((1) ETH Zurich, (2) RWTH Aachen University)

PDF

Open Access

TL;DR

This paper proposes a multi-layered domain-specific language (DSL) approach to generate optimized linear algebra programs tailored to specific application needs, enhancing flexibility beyond traditional libraries.

Contribution

It introduces a novel multi-layer DSL framework for program generation that customizes linear algebra routines for particular sizes, interfaces, and architectures.

Findings

01

Enables flexible and optimized linear algebra program generation.

02

Improves upon traditional library limitations in flexibility.

03

Demonstrates tailored routines for specific application requirements.

Abstract

Numerical software in computational science and engineering often relies on highly-optimized building blocks from libraries such as BLAS and LAPACK, and while such libraries provide portable performance for a wide range of computing architectures, they still present limitations in terms of flexibility. We advocate a domain-specific program generator capable of producing library routines tailored to the specific needs of the application in terms of sizes, interface, and target architecture.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Logic, programming, and type systems · Numerical Methods and Algorithms

Full text

Program Generation for Linear Algebra Using Multiple Layers of DSLs

Daniele G. Spampinato1, Diego Fabregat-Traver2, Markus Püschel1 and Paolo Bientinesi2

1Department of Computer Science, ETH Zurich, Switzerland

2Aachen Institute for Advanced Study in Computational Engineering Science, RWTH Aachen, Germany

Email: 1{danieles, pueschel}@inf.ethz.ch, 2{fabregat, pauldj}@aices.rwth-aachen.de

Motivation. Numerical software in computational science and engineering often relies on highly-optimized building blocks from libraries such as BLAS [1] and LAPACK [2]. Examples of such blocks include, but are not limited to, matrix multiplications, matrix factorizations, and solvers for Sylvester-like equations. While the BLAS and LAPACK libraries have been very successful in providing portable performance for a wide range of computing architectures, they still present severe limitations in terms of flexibility. First, these libraries are optimized for large matrices (of sizes at least in the hundreds). Second, the interface in terms of operations and matrix structures they provide specifically targets computational science. These limitations can render those libraries suboptimal in performance or code size for applications in communications, graphics, and control, which may require smaller scale computations and a more flexible interface.

Contributions. To overcome these limitations, we advocate a domain-specific program generator capable of producing library routines tailored to the specific needs of the application in terms of sizes, interface, and target architecture. In this work, we introduce such a generator that translates a desired linear algebra computation, annotated with matrix properties, into optimized C code, optionally vectorized with intrinsics. The generator unites prior work on two independent frameworks: The FLAME-based Cl1ck [3, 4, 5] and LGen [7, 8], which was designed after Spiral [6].

For a given linear algebra problem such as a matrix factorization, matrix inversion, or equation to be solved, Cl1ck synthesizes families of blocked algorithms that rely on basic computations provided by BLAS. These, in turn, are compiled into efficient, vectorized C code by (an extension of) LGen.

Both tools use internally DSLs to perform the necessary reasonings and transformations at different levels of abstraction including partitioning and traversal of operands and the preferred tiling shapes for locality and vectorization. The integration of Cl1ck and LGen yields an end-to-end multi-DSL program generator that produces tailored library routines.

As case studies, we consider the Cholesky decomposition, and solvers for the continuous-time Lyapunov and Sylvester equations. We compare the performance of our generated code with the commercial Intel MKL showing competitive results.

Generator structure. Fig. 1 illustrates the structure of our linear algebra generator. The input is an annotated linear algebra expression, e.g., $X^{T}X=A$ for a Cholesky decomposition, where $A$ is the input matrix, symmetric, positive definite, and $X$ is the output matrix and upper triangular. The output of the generator is an optimized C function, optionally vectorized with intrinsics.

The generator integrates the processing stages from Cl1ck (left column in Fig. 1) and LGen (right column) into a single framework. In particular, the stages on the left take the input expression and synthesize for it several algorithms. To do so, first the input equation is transformed into one or more Partitioned Matrix Expressions (PMEs), i.e., a recursive definition of the equation; then, the PMEs are decomposed to identify loop invariants; finally, a family of loop-based algorithms are built around these loop invariants using formal methods techniques. The synthesized algorithms are mathematically equivalent, but may yield different performance.

Given a particular algorithm, the stages on the right in Fig. 1 proceed with its implementation. In particular, matrices are tiled for locality and summation-based definitions of each statement in the algorithm is determined taking matrix structures into account. Decisions at this stage are either search-based (e.g., tile sizes) or model-based (e.g., matrices traversal). Vectorization requires an additional level of tiling to expose operations whose sizes match the vector length of the CPU (see $\nu$ -BLACs in [7]). These are later mapped to a small set of pre-implemented basic blocks efficiently vectorized with intrinsics.

Finally, autotuning (dashed arrows in Fig. 1) is used to select the fastest code searching among alternative algorithms and implementation parameters.

A multi-DSL approach. All processing stages in Fig. 1 but the last one (code-level optimization) operate on mathematical DSLs at different levels of abstraction. More specifically, the LA language is a subset of standard linear algebra notation. The p-LA (partitioned LA) language is a refinement of LA which represents equations in terms of partitioned operands (submatrices and subvectors) together with knowledge inferred during the partitioning of the operands. The lp-LA (loop-based p-LA) notation introduces programming constructs such as explicit loops, and encodes algorithms at a high-level of abstraction, resembling the FLAME notation [3].

The LL (Linear algebra Language) and $\Sigma$ -LL DSLs represent basic linear algebra operations but at a lower level of abstraction. They include explicit gather and scatter operators on matrices and capture matrix structures mathematically using techniques from polyhedral compilation [9]. The LL backend stage (bottom-left in Fig. 1) is used to bridge lp-LA and LL notations.

Preliminary performance results. In Fig. 2 we show performance results for three problems: 2 the Cholesky decomposition, and the solution of triangular, continuous-time 2 Lyapunov and 2 Sylvester equations. Computations are performed in double precision, and all matrices are square with dimensions divisible by four (the vector length of AVX). We compare the AVX-vectorized routines generated by our system specific for each input size with the corresponding routines from the LAPACK implementation in Intel MKL (which is generic in size).

For each case, our generator explores the implementation of different algorithms. For example, in Fig. 2 the final code is chosen among four different variants. The specialized code produced by our generator can be as fast as MKL (Fig. 2) or outperforms it (between $4\times$ and $4.8\times$ in Fig. 2, and between $2\times$ and $2.5\times$ in Fig. 2).

Limitations and future work. We presented ongoing work and are currently addressing a few limitations of our generator. In particular: (1) the number of temporary submatrices in the algorithm construction phase can be reduced; (2) more complete and efficient vectorization including for sizes not divisible by the vector length.

Finally, we are considering the replacement of the current autotuning loops with a model-based approach.

Bibliography9

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Dongarra et al. [1990] J. J. Dongarra et al . A set of level 3 basic linear algebra subprograms. ACM Trans. on Mathematical Software (TOMS) , 16(1):1–17, 1990.
2Anderson et al. [1999] E. Anderson et al . LAPACK Users’ Guide . Society for Industrial and Applied Mathematics, third edition, 1999.
3P. Bientinesi et al. [2005] P. Bientinesi, J. A. Gunnels, M. E. Myers, E. S. Quintana-Ortí, and R. A. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. on Mathematical Software (TOMS) , 31(1):1–26, 2005.
4Fabregat and Bientinesi [2011 a] D. Fabregat-Traver and P. Bientinesi. Automatic Generation of Loop-Invariants for Matrix Operations. In Computational Science and Its Applications (ICCSA) , pp. 82–92, 2011.
5Fabregat and Bientinesi [2011 b] D. Fabregat-Traver and P. Bientinesi. Knowledge-Based Automatic Generation of Partitioned Matrix Expressions. In Computer Algebra in Scientific Computing (CASC) , vol. 6885 of Lecture Notes in Computer Science (LNCS) , pp. 144–157. Springer, 2011.
6Püschel et al. [2011] M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing , chap. Spiral. Springer, 2011.
7Spampinato and Püschel [2014] D. G. Spampinato and M. Püschel. A basic linear algebra compiler. In Code Generation and Optimization (CGO) , pp. 23–32, 2014.
8Spampinato and Püschel [2016] D. G. Spampinato and M. Püschel. A basic linear algebra compiler for structured matrices. In Code Generation and Optimization (CGO) , pp. 117–127, 2016.