Cache-oblivious Matrix Multiplication for Exact Factorisation

Fatima K. Abu Salem; Mira Al Arab

arXiv:1705.04807·cs.NA·May 16, 2017

Cache-oblivious Matrix Multiplication for Exact Factorisation

Fatima K. Abu Salem, Mira Al Arab

PDF

Open Access

TL;DR

This paper introduces a cache-oblivious matrix multiplication method using Morton-hybrid space-filling curves, significantly improving runtime for exact matrix factorization over finite fields.

Contribution

It develops a novel cache-oblivious approach for matrix multiplication tailored for parallel TU decomposition with Morton-hybrid layout, enhancing efficiency.

Findings

01

Orders of magnitude faster sequential evaluation

02

Low span in recursive matrix multiplication

03

Effective incorporation into parallel decomposition

Abstract

We present a cache-oblivious adaptation of matrix multiplication to be incorporated in the parallel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation. To realise this, we introduce the concepts of alignment and containment of sub-matrices under the Morton-hybrid layout. We redesign the decompositions within the recursive matrix multiplication to force the base case to avoid all jumps in address space, at the expense of extra recursive matrix multiplication (MM) calls. We show that the resulting cache oblivious adaptation has low span, and our experiments demonstrate that its sequential evaluation order demonstrates orders of magnitude improvement in run-time, despite the recursion overhead.

Figures8

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Percentage Improvement in Runtime of the cache-oblivious algorithm

% Inc. in Calls	% of Exp.	Avg. Imp.	Min. Imp.	Max. Imp.
0	6.8	96.15	95.40	97.73
3.13	1.7	95.94	95.40	96.59
6.25	1.7	96.01	95.40	96.59
24.14	1.3	95.69	95.19	96.19
34.38	9.4	95.93	94.83	96.61
37.5	20.3	95.82	94.83	96.61
38.57	0.9	95.94	95.73	96.15
41.8	0.9	95.75	95.74	95.76
42.77	0.9	95.27	94.83	95.73
46.09	0.9	95.73	94.83	96.61
80.57	2.6	95.88	95.48	96.15
84.77	5.1	95.64	94.87	96.20
100	18.6	96.03	95.35	97.73
106.25	1.7	95.67	95.40	96.01
112.5	1.7	95.93	95.35	96.59
168.75	10.3	95.60	94.83	96.61
175	5.1	95.93	94.92	96.61
300	10.3	95.69	95.35	97.73

Equations8

e_{N E_{M^{k}}} = α_{N E_{M^{k}}} + λ^{'} - 1 = α_{M^{k}} + 2 \cdot λ^{'} - 1

e_{N E_{M^{k}}} = α_{N E_{M^{k}}} + λ^{'} - 1 = α_{M^{k}} + 2 \cdot λ^{'} - 1

e_{S W_{M^{k}}} = α_{S W_{M^{k}}} + λ^{'} - 1 = α_{M^{k}} + 3 \cdot λ^{'} - 1.

e_{S W_{M^{k}}} = α_{S W_{M^{k}}} + λ^{'} - 1 = α_{M^{k}} + 3 \cdot λ^{'} - 1.

c_{W} = e x t r a c t_j (e_{S W_{M^{k}}}) - e x t r a c t_j (α_{S_{M^{k}}}) + 1,

c_{W} = e x t r a c t_j (e_{S W_{M^{k}}}) - e x t r a c t_j (α_{S_{M^{k}}}) + 1,

\begin{array}[]{ccc}NW_{B^{k}}$ and $NW_{C^{k}}&&SW_{B^{k}}$ and $NW_{C^{k}}\\ NW_{B^{k}}$ and $NE_{C^{k}}&&SW_{B^{k}}$ and $NE_{C^{k}}\\ NW_{B^{k}}$ and $SW_{C^{k}}&&SW_{B^{k}}$ and $SW_{C^{k}}\\ NW_{B^{k}}$ and $SE_{C^{k}}&&SW_{B^{k}}$ and $SE_{C^{k}}\\ $ $&&$ $\\ NE_{B^{k}}$ and $NW_{C^{k}}&&SE_{B^{k}}$ and $NW_{C^{k}}\\ NE_{B^{k}}$ and $NE_{C^{k}}&&SE_{B^{k}}$ and $NE_{C^{k}}\\ NE_{B^{k}}$ and $SW_{C^{k}}&&SE_{B^{k}}$ and $SW_{C^{k}}\\ NE_{B^{k}}$ and $SE_{C^{k}}&&SE_{B^{k}}$ and $SE_{C^{k}}\\ \end{array}

\begin{array}[]{ccc}NW_{B^{k}}$ and $NW_{C^{k}}&&SW_{B^{k}}$ and $NW_{C^{k}}\\ NW_{B^{k}}$ and $NE_{C^{k}}&&SW_{B^{k}}$ and $NE_{C^{k}}\\ NW_{B^{k}}$ and $SW_{C^{k}}&&SW_{B^{k}}$ and $SW_{C^{k}}\\ NW_{B^{k}}$ and $SE_{C^{k}}&&SW_{B^{k}}$ and $SE_{C^{k}}\\ $ $&&$ $\\ NE_{B^{k}}$ and $NW_{C^{k}}&&SE_{B^{k}}$ and $NW_{C^{k}}\\ NE_{B^{k}}$ and $NE_{C^{k}}&&SE_{B^{k}}$ and $NE_{C^{k}}\\ NE_{B^{k}}$ and $SW_{C^{k}}&&SE_{B^{k}}$ and $SW_{C^{k}}\\ NE_{B^{k}}$ and $SE_{C^{k}}&&SE_{B^{k}}$ and $SE_{C^{k}}\\ \end{array}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Cryptography and Residue Arithmetic

Full text

Cache-oblivious Matrix Multiplication for Exact Factorisation

Fatima K. Abu Salem111Corresponding author E-mail: [email protected]

Computer Science Department, American University of Beirut,

P. O. Box 11-0236, Riad El Solh, Beirut 1107 2020, Lebanon

Mira Al Arab 222E-mail: [email protected]

Computer Science Department, American University of Beirut,

P. O. Box 11-0236, Riad El Solh, Beirut 1107 2020, Lebanon

Abstract

We present a cache-oblivious adaptation of matrix multiplication to be incorporated in the parallel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation. To realise this, we introduce the concepts of alignment and containment of sub-matrices under the Morton-hybrid layout. We redesign the decompositions within the recursive matrix multiplication to force the base case to avoid all jumps in address space, at the expense of extra recursive matrix multiplication (MM) calls. We show that the resulting cache oblivious adaptation has low span, and our experiments demonstrate that its sequential evaluation order demonstrates orders of magnitude improvement in run-time, despite the recursion overhead.

Keywords: Locality of reference, Cache-oblivious Algorithms, Space-filling Curves, Morton-hybrid Layout, TU Decomposition, Finite Fields

1 Introduction

We present a cache-oblivious adaptation of matrix multiplication to be incorporated in the paralel TU decomposition for rectangular matrices over finite fields, based on the Morton-hybrid space-filling curve representation. Exact triangulisation of matrices is crucial for a large range of problems in Computer Algebra and Algorithmic Number Theory, where a basis of the solution set of the associated linear system is required. Our focal algorithm of reference is the TURBO algorithm of Dumas et al. [7] for exact LU decomposition. This algorithm recurses on rectangular and potentially singular matrices. TURBO significantly reduces the volume of communication on distributed systems, and retains optimal work and linear span. TURBO can also compute the rank in an exact manner. As benchmarked against some of the most efficient current exact elimination algorithms in the literature, TURBO incurs low synchronisation costs and reduces the communication cost featured in [9, 10] by a factor of one third when used with only one level of recursion on 4 processors. A significant part of TURBO consists of matrix factorisation, and so, adapting this kernel in a cache-oblivious fashion will ultimately contribute to a cache oblivious factorisation algorithm. That TURBO has low depth makes adapting its sequential version to the cache-oblivious model more telling. Particularly, nested parallel algorithms for which the natural sequential execution has low cache complexity will also attain good cache complexity on parallel machines with private or shared caches [4].

At the base case of TURBO the sub-matrices reach a given threshold, and so one can take advantage of cache effects. To the best of our knowledge, no cache oblivious (or cache aware) algorithms for exact linear algebra exist in the literature. We pursue a cache oblivious adaptation using space-filling curves. TURBO requires index conversion routines from the space curve chosen and the cartesian order, due to the row and column permutations. In [1], using a detailed analysis of the number of bit operations required for index conversion, and filtering the cost of lookup tables that represent the recursive decomposition of the Hilbert curve, we have shown that the Morton-hybrid order incurs the least cost for index conversion routines as compared to the Hilbert, Peano, or Morton orders. The Morton order is the recursive $Z$ -shaped space filling curve (Fig. LABEL:fig:fmorton_refinement). The Morton-hybrid order stops decomposing when the submatrices attain a threshold dimension $T^{\prime}\times T^{\prime}$ [2]. At such a level, say when the submatrix fits in cache, the overhead for maintaining the curve representation outweighs the reduction in cache complexity. In reference to the literature cited in this manuscript around the Morton-order and its hybrid, this curve representation improves significantly on the temporal locality of various matrix algorithms such as naive multiplication, LU decomposition, and QR factorisation.

In this work, we introduce the concepts of alignment and containment of sub-matrices under the Morton-hybrid layout, and develop the full details of the MM algorithm by which it observes the alignment and containment of sub-matrices invariably across the matrix factorisation recursive steps. We do this by redesigning the decompositions within the recursive MM to force the base case to avoid all jumps in address space, at the expense of extra recursive MM calls. We show that the resulting cache oblivious adaptation retains optimal work and critical path length as default MM and thus is highly parallel. Our experiments confirm that the recursion overhead in the Morton-hybrid MM is negligible and leads to significant reduction in run-time thanks to its improved temporal locality.

Before proceeding, we begin with brief description of the TU algorithm. Consider a rectangular matrix $A$ over a field $\mathbb{F}$ , where $A$ may be singular. $A$ is triangulated into the product of two matrices $T$ and $U$ , such that $A=T\cdot U$ , where $U$ is a upper triangular matrix, and $T$ is with some “ $T$ ” patterns. This is done in a series of recursive steps on rectangular and potentially singular matrices, relaxing the condition for generating a strictly lower triangular matrix: (1) Recursive TU decomposition in SE, SW, NE, and NW (2) Virtual row and column permutations needed to re-order the blocks to yield the matrix $U$ . For brevity and because of lack of space, we omit the full details of the algorithm and refer the reader to [7] for a full account on TURBO.

2 Non-Aligned Rectangular Sub-matrix Multiplication Within The Recursion

Consider Morton-hybrid matrices $A$ , $B$ , and $C$ and let $S_{A}$ , $S_{B}$ , and $S_{C}$ be random sub-matrices of $A$ , $B$ , and $C$ respectively, for which one has to compute $S_{A}=S_{B}\cdot S_{C}$ . This is a typical scenario encountered during the TU decomposition. To illustrate further, consider Fig. 7. Each integer appearing in the matrices in that figure represents the corresponding Morton-hybrid index of the element occupying it. The sub-matrices on which the multiplication is performed do not begin at the first entry of a Morton-hybrid sub-matrix, hence the concept of an aligned versus non-aligned Morton-hybrid sub-matrix.

An aligned sub-matrix is a $2^{a}\cdot T\times 2^{b}\cdot T$ sub-matrix of a Morton-hybrid matrix that begins at the first entry of a row-major sub-matrix. A non-aligned sub-matrix is a sub-matrix of a Morton-hybrid matrix that does not satisfy this condition.

Corollary 2.1

The Cartesian index of the first entry of an aligned sub-matrix is of the form $(k_{1}\cdot T,k_{2}\cdot T)$ , for any positive integers $k_{1}$ and $k_{2}$ .

Proof: By its definition, an aligned sub-matrix $A_{M}$ of a Morton-hybrid matrix $M$ starts at the first entry of some row-major sub-matrix $S_{M}$ of $M$ . Since the row-major sub-matrix $S_{M}$ is of dimensions $T^{\prime}\times T^{\prime}$ , the Cartesian index of the first entry of $S_{M}$ is given by $(k_{1}\cdot T,k_{2}\cdot T)$ , for some positive integers $k_{1}$ and $k_{2}$ . By its definition, the aligned sub-matrix $A_{M}$ begins at an element of Cartesian index $(k_{1}\cdot T,k_{2}\cdot T)$ .

Corollary 2.2

If an aligned sub-matrix is $T^{\prime}\times T^{\prime}$ , then it is row-major.

Proof: Let $A_{M}$ be a $T^{\prime}\times T^{\prime}$ aligned sub-matrix of a Morton-hybrid matrix $M$ . From the definition of an aligned matrix, we know that $A_{M}$ begins at the first entry of a row-major sub-matrix $S_{M}$ of $M$ . According to the Morton-hybrid layout, all row-major sub-matrices of $M$ , including $S_{M}$ are $T^{\prime}\times T^{\prime}$ . Since $A_{M}$ is also $T^{\prime}\times T^{\prime}$ , then $A_{M}$ must be $S_{M}$ and hence is row-major.

An example of a non-aligned sub-matrix of a Morton-hybrid matrix with $T^{\prime}=4$ is shown in red in Fig. 7. An aligned sub-matrix is shown in green.

Next, we relate the lack of alignment of sub-matrices to the recursive accessing of these sub-matrices and discuss the implicated problems.

2.1 Non-Aligned Sub-Matrices and loss of locality

A sub-matrix $S_{M}$ of a Morton-hybrid matrix $M$ is said to be contained if $S_{M}$ lies completely within a sub-matrix of $M$ ordered in a row-major fashion. Otherwise, we say that $S_{M}$ is scattered.

Proposition 2.3

Let $A_{M}$ be an aligned sub-matrix of a Morton-hybrid matrix $M$ . The sub-matrix at the base case of the recursive division, down until $T^{\prime}\times T^{\prime}$ sub-matrices, of $A_{M}$ is a $T^{\prime}\times T^{\prime}$ row-major sub-matrix of $M$ .

Proof: First, we claim the recursive division of $A_{M}$ gives 4 aligned sub-matrices. From the definition of aligned sub-matrices, $A_{M}$ has size $2^{a}\cdot T\times 2^{b}\cdot T$ . So, the division of each of the dimensions of $A_{M}$ by 2 results in four quadrants $NW$ , $NE$ , $SW$ , and $SE$ of $A_{M}$ of size $2^{(a-1)}\cdot T\times 2^{(b-1)}\cdot T$ each. Thus these quadrants satisfy the size condition from the definition of aligned matrices. Note that once any of the dimensions reaches size $T^{\prime}$ it is no longer divided, and the recursive division proceeds on the other dimension until that too becomes $T^{\prime}$ . It is the size condition of this same definition that leads to $T^{\prime}\times T^{\prime}$ sub-matrices at the base case of recursive division of the aligned sub-matrices decomposed from $A_{M}$ .

Now, recall, from Cor. 2.1, that the start index of $A_{M}$ is of the form $(k_{1}\cdot T,k_{2}\cdot T)$ . Then, the start indices of the sub-matrices resulting from the sub-division of $A_{M}$ are $(k_{1}\cdot T,k_{2}\cdot T)$ , $(k_{1}\cdot T,(k_{2}+2^{(b-1)})\cdot T)$ , $((k_{1}+2^{(a-1)})\cdot T,k_{2}\cdot T)$ , and $(k_{1}+2^{(a-1)})\cdot T,(k_{2}+2^{(b-1)})\cdot T)$ for the $NW$ , $NE$ , $SW$ , and $SE$ quadrants of $A_{M}$ respectively. Thus the start indices of these quadrants satisfy the start index condition from the definition of aligned matrices. Combining, by those two claims, the four quadrants resulting from the sub-division of any aligned matrix $A_{M}$ are aligned : they satisfy both conditions from the definition of aligned matrices.

Second, we show that the aligned sub-matrices at the base case are row-major sub-matrices of $M$ . If the recursive division continues till $T^{\prime}\times T^{\prime}$ sub-matrices, we get $T^{\prime}\times T^{\prime}$ aligned sub-matrices. From Cor. 2.2, we know that these sub-matrices are row-major sub-matrices of $M$ . This concludes the proof.

Corollary 2.4

Any sub-matrix of the $T^{\prime}\times T^{\prime}$ sub-matrix reached at the base case of the recursive division of an aligned sub-matrix is contained.

Proof: According to Prop. 2.3, the $T^{\prime}\times T^{\prime}$ sub-matrix at the base case of the recursive division of an aligned sub-matrix $A_{M}$ of a Morton-hybrid matrix $M$ is in row-major layout. Hence, any sub-matrix of this $T^{\prime}\times T^{\prime}$ base case sub-matrix lies entirely within a row-major sub-matrix of $M$ and is therefore contained.

In Fig. 7, $C_{M}$ is one of the sub-matrices at the base case of the recursive division of the aligned sub-matrix $A_{M}$ and is a row-major sub-matrix. Any sub-matrix of $C_{M}$ is contained. When non-aligned sub-matrices are recursively divided, the sub-matrix at the base case may not consist entirely of a row-major sub-matrix of the Morton-hybrid matrix. It may be scattered across more than one row-major sub-matrix. For example, in Fig. 7, the sub-matrix $S_{M}$ is a sub-matrix at the base case of the recursion for the non-aligned sub-matrix $N_{M}$ in red. $S_{M}$ spans four row-major ordered sub-matrices - hence, it is scattered. We know that the elements of the sub-matrices at the base case are to be traversed in a row-major or column-major order, as required for the base case of MM. With such traversal imposed, a scattered sub-matrix suffers from two issues:

$P_{1}$ : Elements of a scattered sub-matrix are not sufficiently close in memory to maintain good spatial locality when traversed in a row/column-major fashion. This results in worse memory performance than for contained sub-matrices. 2. 2.

$P_{2}$ : Morton-hybrid encoding is required for accessing each element within a scattered sub-matrix (thus incurring extra computation overhead compared to row-major offset calculation).

Proposition 2.5

The loss in locality defined by $P_{1}$ and $P_{2}$ apply for scattered sub-matrices but not contained sub-matrices.

Proof: We first consider $P_{1}$ . Recall that the traversal of entries at the base case of the recursion is done in two orders: row-major and column-major. For contained sub-matrices, when consecutively accessing any two entries in any of these two orders, the minimum jump in address space is 1 and the maximum is $T^{\prime}$ as all entries lie within one row-major sub-matrix of the Morton-hybrid matrix. A scattered sub-matrix spans more than one row-major sub-matrix of the matrix. These row-major sub-matrices are not necessarily consecutive in memory and traversing, in a row-major or column-major fashion, the scattered sub-matrix that spans these row-major sub-matrices results in jumps in address space. When consecutively accessing any two entries of a scattered sub-matrix, the minimum jump in address space is 1 if the two entries being accessed consecutively belong to the same row-major sub-matrix and the maximum is $k\cdot T^{2}+T-1$ for some positive integer $k$ , if the two entries belong to different row-major sub-matrices.

We now consider $P_{2}$ . Because the base case sub-matrix of an aligned sub-matrix is part of a row-major ordered sub-matrix, offset calculation for the elements at the base case is fast: traditional row-major offset calculation is used. Index $z$ of an element at offset $(i,j)$ from the start index $\sigma$ of the sub-matrix at the base case is given by $z=\sigma+i*T+j$ , since the sub-matrix satisfies a row-major ordering with row length = $T^{\prime}$ . This can be seen for the contained sub-matrix $C_{M}$ shown in Fig. 7 where $\sigma=112$ . As for a non-aligned sub-matrix, accessing any element ( $i,j$ ) in any of the base case sub-matrices requires that the corresponding Morton-hybrid index be calculated. This incurs extra calculation overhead as the encoding of the Morton-hybrid index is more costly than calculating an offset within a row-major ordered sub-matrix.

2.2 Modified Non-Aligned Sub-Matrix Multiplication

We aim to improve the sub-matrix multiplication procedure by addressing issues $P_{1}$ and $P_{2}$ . In this section, we describe a recursive sub-matrix multiplication algorithm which ensures that the sub-matrices at the base case of the recursion are contained in a row-major ordered sub-matrix of the original matrix. By doing this, we reduce the range of addresses of the elements within the sub-matrices at the base case as well as the number of jumps in address space done at the base case, and we eliminate the need for Morton-hybrid encoding at the base case. To ensure efficiency that the sub-matrix at the base case of MM is contained, by Prop. 2.3, the recursive division within the algorithm must start on aligned matrices. Recall the random matrices $A$ , $B$ , and $C$ in Morton-hybrid order and of dimensions $2^{m}\times 2^{m}$ , and $S_{A}$ , $S_{B}$ , and $S_{C}$ the random sub-matrices of $A$ , $B$ , and $C$ respectively (Fig. 7). We wish to perform the multiplication $S_{A}=S_{B}\cdot S_{C}$ efficiently. We can recursively divide $S_{A}$ , $S_{B}$ , and $S_{C}$ , as in the default MM algorithm, which may result in scattered sub-matrices at the base case since $S_{A}$ , $S_{B}$ , and $S_{C}$ may not be aligned. Instead, we will recursively divide $A$ , $B$ , and $C$ and address only the relevant sub-matrix multiplications that ought to be done to produce $S_{A}=S_{B}\cdot S_{C}$ . As $A$ , $B$ , and $C$ are aligned, recursively dividing them will enforce row-major sub-matrices at the base case from which we extract the relevant parts to produce $S_{A}$ .

Let $k$ be a superscript denoting a recursive step of the proposed MM algorithm. Also, let $t$ , $u$ , and $v$ , denote subscripts in $\{0,1,2,3\}$ (of sub-matrices of $A$ , $B$ , and $C$ respectively), indicating a specific quadrant following the Morton (Z-order): $NW=0$ , $NE=1$ , $SW=2$ , and $SW=3$ . For $k=0$ , $A^{0}_{0}=A$ , $B^{0}_{0}=B$ , and $C^{0}_{0}=C$ . Denote by $S_{A^{k}_{t}}$ , $S_{B^{k}_{u}}$ , and and $S_{C^{k}_{v}}$ the respective sub-matrices of $A^{k}_{t}$ , $B^{k}_{u}$ , and $C^{k}_{v}$ being multiplied as part of the overall multiplication $S_{A}=S_{B}\cdot S_{C}$ . As such the initial problem is to produce $S_{A^{0}_{0}}=S_{B^{0}_{0}}\cdot S_{C^{0}_{0}}$ . For this, we first produce the quadrants $A^{1}_{t^{\prime}}$ of $A^{0}_{0}$ , such that $A^{1}_{t^{\prime}}\in\{NW_{A^{0}_{0}},NE_{A^{0}_{0}},SW_{A^{0}_{0}},SE_{A^{0}_{0}}\}$ for $t^{\prime}\in\{0,1,2,3\}$ . We do the same for $B^{0}_{0}$ and $C^{0}_{0}$ producing $B^{1}_{u^{\prime}}$ and $C^{1}_{v^{\prime}}$ respectively for $u^{\prime},v^{\prime}\in\{0,1,2,3\}$ . For each $A^{1}_{t^{\prime}}$ , we produce the sub-matrix $S^{\prime}_{A^{1}_{t^{\prime}}}$ , defined as the part of $S_{A^{0}_{0}}$ that lies in $A^{1}_{t^{\prime}}$ . Similarly, we produce $S^{\prime}_{B^{1}_{u^{\prime}}}$ , and $S^{\prime}_{C^{1}_{v^{\prime}}}$ . Note that $S_{A^{0}_{0}}$ is the two-dimensional concatenation of $\{S^{\prime}_{A^{1}_{t^{\prime}}}\}$ for $t^{\prime}\in\{0,1,2,3\}$ and hence to calculate $S_{A^{0}_{0}}$ we need to calculate $S^{\prime}_{A^{1}_{t^{\prime}}}$ for $t^{\prime}\in\{0,1,2,3\}$ . To do this, we need to consider all combinations $\Gamma$ of the form $\Gamma_{t^{\prime},u^{\prime},v^{\prime}}=\{A^{1}_{t^{\prime}},B^{1}_{u^{\prime}},C^{1}_{v^{\prime}}\}$ necessary to produce $S_{A^{0}_{0}}$ , as will be justified below. Now, when considering a combination $\Gamma_{t^{\prime},u^{\prime},v^{\prime}}=\{A^{1}_{t^{\prime}},B^{1}_{u^{\prime}},C^{1}_{v^{\prime}}\}$ , if the sub-matrices $S^{\prime}_{A^{1}_{t^{\prime}}}$ , $S^{\prime}_{B^{1}_{u^{\prime}}}$ , and $S^{\prime}_{C^{1}_{v^{\prime}}}$ are compatible for multiplication, i.e. the multiplication $S^{\prime}_{A^{1}_{t^{\prime}}}+=S^{\prime}_{B^{1}_{u^{\prime}}}\cdot S^{\prime}_{C^{1}_{v^{\prime}}}$ is part of the overall multiplication $S_{A^{0}_{0}}+=S_{B^{0}_{0}}\cdot S_{C^{0}_{0}}$ , then a recursive call is made on $S^{\prime}_{A^{1}_{t^{\prime}}}$ , $S^{\prime}_{B^{1}_{u^{\prime}}}$ , and $S^{\prime}_{C^{1}_{v^{\prime}}}$ . Else, if $S^{\prime}_{A^{1}_{t^{\prime}}}$ , $S^{\prime}_{B^{1}_{u^{\prime}}}$ , and $S^{\prime}_{C^{1}_{v^{\prime}}}$ are not compatible, we extract compatible parts of these sub-matrices and we label them as $S_{A^{1}_{t^{\prime}}}$ , $S_{B^{1}_{u^{\prime}}}$ , and $S_{C^{1}_{v^{\prime}}}$ on which the multiplication proceeds recursively. After doing this for all combinations $\Gamma_{t^{\prime},u^{\prime},v^{\prime}}$ for $t^{\prime},u^{\prime},v^{\prime}\in\{0,1,2,3\}$ , we would have calculated $S_{A^{0}_{0}}$ .

We now describe the general $k$ ’th recursive step of Morton-hybrid MM, which consists of a round of four substeps. For simplicity, we drop the subscripts $t$ , $u$ and $v$ of $A^{k}_{t}$ , $B^{k}_{u}$ , and $C^{k}_{v}$ , and we use $M$ to denote any of the matrices $A$ , $B$ , or $C$ , Each aligned $M^{k}$ is identified by two values:

$\alpha_{M^{k}}$ :

the Morton-hybrid index of the first element in the aligned matrix $M^{k}$

$\lambda_{M^{k}}$ :

the number of elements in the aligned sub-matrix $M^{k}$

We are also given the sub-matrices $S_{A^{k}}$ of $A^{k}$ , $S_{B^{k}}$ of $B^{k}$ , and $S_{C^{k}}$ of $C^{k}$ on which we wish to perform the multiplication. Each of the sub-matrices $S_{M^{k}}$ is identified by the following:

$\sigma_{S_{M^{k}}}$ :

the Morton-hybrid index of the first entry of $S_{M^{k}}$

$r_{S_{M^{k}}}$ :

the number of rows of $S_{M^{k}}$

$c_{S_{M^{k}}}$ :

the number of columns of $S_{M^{k}}$

We do not use the 4-tuple $(M,\sigma,r,c)$ to identify the aligned sub-matrices $M^{k}$ because the 3-tuple $(M,\alpha,\lambda)$ simplifies the computations for identification of the quadrants of $M^{k}$ and incorporates the information from the 4-tuple where $\alpha=\sigma$ and $\lambda=r\times c$ .

Step 1: In this step, we need to identify all four aligned quadrants $M^{k+1}_{t}$ , $t\in\{0,1,2,3\}$ , of the aligned $M^{k}$ , for $k$ not reaching the base case, to proceed with the recursive multiplication algorithm. The index $t$ is dropped from $M_{k}$ for simplicity. To do this, we identify the start index $\alpha_{M^{k+1}_{t}}$ and size $\lambda_{M^{k+1}_{t}}$ of each quadrant $M^{k+1}_{t}$ of $M^{k}$ . Because $M^{k}$ is divided into four quadrants of equal size, the number of elements $\lambda_{M^{k+1}_{t}}$ in any quadrant is given by $\lambda^{\prime}=\lambda_{M^{k+1}_{t}}=\lambda_{M^{k}}/4$ . Recall, that in the Morton-hybrid order, the quadrants not reaching the base case are stored according to the Morton layout. For the Morton order, the quadrants of $M^{k}$ are laid out in the order $NW_{M^{k}}$ , $NE_{M^{k}}$ , $SW_{M^{k}}$ , then $SE_{M^{k}}$ , and hence

•

$\alpha_{NW_{M^{k}}}=\alpha_{M^{k}}$

•

$\alpha_{NE_{M^{k}}}=\alpha_{M^{k}}+\lambda^{\prime}$

•

$\alpha_{SW_{M^{k}}}=\alpha_{M^{k}}+2(\lambda^{\prime})$

•

$\alpha_{SE_{M^{k}}}=\alpha_{M^{k}}+3(\lambda^{\prime})$

The sub-matrix $S_{M^{k}}$ may not lie entirely within one quadrant of $M^{k}$ and hence all quadrants $M^{k+1}_{t}$ of $M^{k}$ which contain part of $S_{M^{k}}$ must be considered, which is the case in the example from Fig. 7 as $S_{A^{k}}$ , $S_{B^{k}}$ , and $S_{C^{k}}$ touch on all four quadrants of $A^{k}$ , $B^{k}$ , and $C^{k}$ respectively. Given $S_{M^{k}}$ , we must now identify, for each $M^{k+1}_{t}$ , the part of $S_{M^{k}}$ that lies within $M^{k+1}_{t}$ . We denote this sub-matrix by $S^{\prime}_{M^{k+1}_{t}}$ . The method to identify $S^{\prime}_{M^{k+1}_{t}}$ now follows.

Step 2: Recall that we are given $M^{k}$ and $S_{M^{k}}$ as input into the recursion. As $S_{M^{k}}$ may not lie entirely within one quadrant of $M^{k}$ , it is scattered, and we must identify the parts of $S_{M^{k}}$ which lie in $M^{k+1}$ denoted by $S^{\prime}_{M^{k+1}}$ . We have identified the quadrants $M^{k+1}_{t}$ , for $t\in\{0,1,2,3\}$ , and now we will identify the part of $S_{M^{k}}$ that lies within each $M^{k+1}_{t}$ , denoted by $S^{\prime}_{M^{k+1}_{t}}$ . Then $S_{M^{k}}$ is the two-dimensional concatenation of $\{S^{\prime}_{M^{k+1}_{t}}\}$ for $t\in\{0,1,2,3\}$ . Here we drop the index $t$ for simplicity. To identify $S^{\prime}_{M^{k+1}}$ , we need to identify its start index $\sigma_{S^{\prime}_{M^{k+1}}}$ and dimensions $r_{S^{\prime}_{M^{k+1}}}\times c_{S^{\prime}_{M^{k+1}}}$ . To do this, the following intermediate values are needed. For simplicity, the indices of the intermediate values denoting dependence on $M^{k}$ are omitted.

$r_{\mathcal{N}}$ :

The number of rows of $S_{M^{k}}$ in $\mathcal{N}$ , the northern half of $M^{k}$ .

$c_{\mathcal{W}}$ :

The number of columns of $S_{M^{k}}$ in $\mathcal{W}$ , the western half of $M^{k}$ .

$r_{\mathcal{S}}$ :

The number of rows of $S_{M^{k}}$ in $\mathcal{S}$ , the southern half of $M^{k}$ .

$c_{\mathcal{E}}$ :

The number of columns of $S_{M^{k}}$ in $\mathcal{E}$ , the eastern half of $M^{k}$ .

$e_{NW_{M^{k}}}$ :

The Morton-hybrid index of the last entry of $NW_{M^{k}}$ . Similarly for $NE_{M^{k}}$ , $SW_{M^{k}}$ , and $SE_{M^{k}}$ .

$encode(i,j)$ :

Given an entry $e$ of Cartesian index $(i,j)$ , $encode(i,j)$ returns the Morton-hybrid index of $e$

$extract\_i(z)$ :

Given an entry $e$ of Morton-hybrid index $z$ , $extract\_i(z)$ returns the coordinate $i$ of the Cartesian index $(i,j)$ of $e$

$extract\_j(z)$ :

Given an entry $e$ of Morton-hybrid index $z$ , $extract\_j(z)$ returns the coordinate $j$ of the Cartesian index $(i,j)$ of $e$

The identification of $S^{\prime}_{M^{k+1}}$ is done as follows:

•

For $NE_{M^{k}}$ , calculate $e_{NE_{M^{k}}}$ as follows:

[TABLE]

and, for $SW_{M^{k}}$ , $e_{SW_{M^{k}}}$ use

[TABLE]

•

Find $r_{\mathcal{N}}$ using $r_{\mathcal{N}}=extract\_i(e_{NE_{M^{k}}})-extract\_i(\sigma_{S_{M^{k}}})+1$ , i.e. $r_{\mathcal{N}}$ is the difference between the row indices of the last entry of $NE_{M^{k}}$ and the first entry of $S_{M^{k}}$ and represents the number of rows of $S_{M^{k}}$ in the northern half of $M^{k}$ . Similarly, we find

[TABLE]

the number of columns of $S_{M^{k}}$ in the western half of $M^{k}$ . Note that if $r_{\mathcal{N}}<=0$ then no part of $S_{M^{k}}$ lies in the north half of $M^{k}$ and if $c_{\mathcal{W}}<=0$ then no part of $S_{M^{k}}$ lies in the west half of $M^{k}$ . After finding $r_{\mathcal{N}}$ and $c_{\mathcal{W}}$ , we can find $r_{\mathcal{S}}$ and $c_{\mathcal{E}}$ using $r_{\mathcal{S}}=r_{S_{M^{k}}}-r_{\mathcal{N}}$ and $c_{\mathcal{E}}=c_{S_{M^{k}}}-c_{\mathcal{W}}$ , which are the remaining rows and columns of $S_{M^{k}}$ respectively

•

So far, we have found the number of rows in the northern and southern halves of $M^{k}$ and the number of columns in the western and eastern halves of $M^{k}$ and we want to identify $S^{\prime}_{NW_{M^{k}}}$ , $S^{\prime}_{NE_{M^{k}}}$ , $S^{\prime}_{SW_{M^{k}}}$ and $S^{\prime}_{SW_{M^{k}}}$ for each $M^{k}\in\{A^{k},B^{k},C^{k}\}$ . Recall that we are able to identify a sub-matrix by a 4-tuple $(M,\sigma,r,c)$ , where $\sigma$ is the Morton-hybrid index of the first entry of the sub-matrix and $r$ and $c$ are its row and column dimensions respectively. Let $(i_{S_{M^{k}}},j_{S_{M^{k}}})$ denote the Cartesian index of the first entry of $S_{M^{k}}$ found using $i_{S_{M^{k}}}=extract\_i(\sigma_{S_{M^{k}}})$ and $j_{S_{M^{k}}}=extract\_j(\sigma_{S_{M^{k}}})$ . We now identify $S^{\prime}_{NW_{M^{k}}}$ , $S^{\prime}_{NE_{M^{k}}}$ , $S^{\prime}_{SW_{M^{k}}}$ and $S^{\prime}_{SW_{M^{k}}}$ according to the following cases:

For $S^{\prime}_{NW_{M^{k}}}$ , $\sigma_{S^{\prime}_{NW_{M^{k}}}}=\sigma_{S_{M^{k}}}$ , $r_{S^{\prime}_{NW_{M^{k}}}}=r_{\mathcal{N}}$ , and $c_{S^{\prime}_{NW_{M^{k}}}}=c_{\mathcal{W}}$ . 2. 2.

For $S^{\prime}_{NE_{M^{k}}}$ , $\sigma_{S^{\prime}_{NE_{M^{k}}}}=encode(i_{S_{M^{k}}},j_{S_{M^{k}}}+c_{\mathcal{W}})$ , $r_{S^{\prime}_{NE_{M^{k}}}}=r_{N}$ , and $c_{S^{\prime}_{NE_{M^{k}}}}=c_{\mathcal{E}}$ . 3. 3.

For $S^{\prime}_{SW_{M^{k}}}$ , $\sigma_{S^{\prime}_{SW_{M^{k}}}}=encode(i_{S_{M^{k}}}+r_{\mathcal{N}},j_{S_{M^{k}}})$ , $r_{S^{\prime}_{SW_{M^{k}}}}=r_{\mathcal{S}}$ , and $c_{S^{\prime}_{SW_{M^{k}}}}=c_{\mathcal{W}}$ . 4. 4.

For $S^{\prime}_{SE_{M^{k}}}$ , $\sigma_{S^{\prime}_{SE_{M^{k}}}}=encode(i_{S_{M^{k}}}+r_{\mathcal{N}},j_{S_{M^{k}}}+c_{\mathcal{W}})$ , $r_{S^{\prime}_{SE_{M^{k}}}}=r_{\mathcal{S}}$ , and $c_{S^{\prime}_{SE_{M^{k}}}}=c_{\mathcal{E}}$ .

To justify these cases we will explain how we arrived at case 2 for example where we identify the start index and dimensions of $S^{\prime}_{NE_{M^{k}}}$ as shown in Fig. 7. The rest follow similarly. Recall that $\sigma_{S_{M^{k}}}$ denotes the Morton-hybrid index of the first element of $S_{M^{k}}$ , and that $(i_{S_{M^{k}}},j_{S_{M^{k}}})$ is the corresponding Cartesian index. The index $(i_{S_{M^{k}}},j_{S_{M^{k}}}+c_{\mathcal{W}})$ is the Cartesian index of the first element in $S^{\prime}_{NE_{M^{k}}}$ . The corresponding Morton-hybrid index $\sigma_{S^{\prime}_{NE_{M^{k}}}}$ can be found using the function $encode(i_{S_{M^{k}}},j_{S_{M^{k}}}+c_{\mathcal{W}})$ . The dimensions of $S^{\prime}_{NE_{M^{k}}}$ are $r_{\mathcal{N}}\times c_{\mathcal{E}}$ .

Note that for each $S^{\prime}_{M^{k+1}}$ , the Cartesian index of the start entry of Morton-hybrid index $\sigma_{S^{\prime}_{M^{k+1}}}$ is given by $(i_{S_{M^{k}}}+\varphi_{r_{M}},j_{S_{M^{k}}}+\varphi_{c_{M}})$ , for $\varphi_{r_{M}}\in\{0,r_{\mathcal{N}}\}$ and $\varphi_{c_{M}}\in\{0,c_{\mathcal{W}}\}$ .

Step 3: By now we have decomposed each $M^{k}$ into quadrants, and we have identified, for each quadrant $M^{k+1}_{t}$ , the part of $S_{M^{k}}$ within that quadrant denoted by $S^{\prime}_{M^{k+1}_{t}}$ . The matrix $S_{M^{k}}$ is the two-dimensional concatenation of $\{S^{\prime}_{M^{k+1}_{t}}\}$ for $t\in\{0,1,2,3\}$ . Next, we identify which quadrants $A^{k+1}_{t}$ , $B^{k+1}_{u}$ , and $C^{k+1}_{v}$ to consider for recursive multiplication. For each quadrant $A^{k+1}_{t}\in\{NW_{A^{k}},NE_{A^{k}},SW_{A^{k}},SE_{A^{k}}\}$ of $A^{k}$ , we have identified $S^{\prime}_{A^{k+1}_{t}}$ (same for for $S^{\prime}_{B^{k+1}_{u}}$ and $S^{\prime}_{C^{k+1}_{v}}$ , for all $u,v\in\{0,1,2,3\}$ ). We need to perform the multiplications within $S_{B^{k}}$ and $S_{C^{k}}$ required to calculate $S^{\prime}_{A^{k+1}_{t}}$ . As an example, examine $NW_{A^{k}}$ from Fig. 7. The sub-matrix $S^{\prime}_{NW_{A^{k}}}$ is given by the 4-tuple $(A,5,3,3)$ . We identify this tuple using Step 2 above. To calculate $S^{\prime}_{NW_{A^{k}}}=(A,5,3,3)$ of $S_{A^{k}}$ , the sub-matrix from $S_{B^{k}}$ given by $(B,9,3,4)$ is to be multiplied by the sub-matrix from $S_{C^{k}}$ given by $(C,11,4,3)$ . According to our approach, this will be done in a way so as to ensure that the sub-matrices being multiplied at the base case are contained , which improves locality and reduces conversion overhead as described earlier. Because the sub-matrices $(B,9,3,4)$ and $(C,11,4,3)$ of $S_{B^{k}}$ and $S_{C^{k}}$ touch on all four quadrants of $B^{k}$ and $C^{k}$ , and we want to calculate $S^{\prime}_{NW_{A^{k}}}$ , all quadrants of $B^{k}$ are to be considered for multiplication with all quadrants of $C^{k}$ . Those are given by the following sixteen combinations of quadrants from $B^{k}$ and $C^{k}$ :

[TABLE]

All of these are needed to calculate $S^{\prime}_{NW_{A^{k}}}$ . But, to calculate the sub-matrix $S_{A^{k}}$ , we need to find $S^{\prime}_{NE_{A^{k}}}$ , $S^{\prime}_{SW_{A^{k}}}$ , and $S^{\prime}_{SE_{A^{k}}}$ in addition to $S^{\prime}_{NW_{A^{k}}}$ because $S_{A^{k}}$ is the two-dimensional concatenation of $\{S_{A^{k+1}_{t}}\}$ for $t\in\{0,1,2,3\}$ . Similarly as above, to determine each of these quadrants of $A^{k}$ requires sixteen combinations of quadrants from $B^{k}$ and $C^{k}$ . In total, to find $S_{A^{k}}$ , we would need up to sixty-four combinations of quadrants from $A^{k}$ , $B^{k}$ , and $C^{k}$ .

Step 4: For each combination, if $S^{\prime}_{A^{1}_{t^{\prime}}}$ , $S^{\prime}_{B^{1}_{u^{\prime}}}$ , and $S^{\prime}_{C^{1}_{v^{\prime}}}$ are not compatible, we extract compatible parts of these sub-matrices and we label them as $S_{A^{1}_{t^{\prime}}}$ , $S_{B^{1}_{u^{\prime}}}$ , and $S_{C^{1}_{v^{\prime}}}$ on which the multiplication proceeds recursively. How to extract compatible parts is beyond the scope of the present manuscript and is left for future work 333We also note that omiting this part of the algorithm does not deflect from its main rationale.. For now, we concede that omiting it does not divert from the general understanding of the overall algorithm, and that the work requirements for this step can be embedded in that required to perform Steps 1 –¿ 3 above.

Proposition 2.6

If using auxiliary space to peform the matrix additions, and assuming the matrix is of dimensions at most $2^{\alpha}\times 2^{\alpha}$ , where $\alpha$ is the machine word-size, the cache oblivious MM using Morton-hybrid order requires asymptotically the same work and critical path lenth as default MM.

Proof: On work: The cache-oblivious algorithm is a divide and conquer algorithm. The divide phase introduces two new functions over the default MM algorithm consisting of Steps 1 and 2 above. Each of these steps requires a constant number of arithmetic operations and calls to encoding and extraction procedures. From Sec. 3.5 of [1], we know that each encoding or extraction procedure incurs a constant number of operations assuming the matrix is of dimensions at most $2^{\alpha}\times 2^{\alpha}$ , where $\alpha$ is the machine word-size. For the typical value $\alpha=64$ , such matrix sizes are sufficiently large for many applications. It follows that the work of the cache-oblivious algorithm is asymptotically the work of the default algorithm given by $\Theta(n^{3})$ . The conquer part creates non-overlapping sub-problems in Steps 3 and 4 above whose union yields the original matrix to be multipled.

On parallelism: All of the extra 64 recursive calls are independent and thus can be cast in parallel. If auxiliary space is available to perform the matrix additions required for each MM, one can also perform addition in parallel using the standard algorithm (Ch. 27 of [6]). Hence, the critical path length of the cache-oblivious algorithm remains that of the default multithreaded algorithm and is known to be $\Theta(\lg^{2}n)$ .

Remarks on implications for Parallel Performance: The sub-matrices at the base case of the recursion are contained within a row-major sub-matrix, thanks to enforcing aligned sub-matrices for the recursive division. The Morton-hybrid, cache-oblvious version demonstrates superior performance over the default algorithm, and eliminates the need for Morton-hybrid index conversion when accessing each element in the sub-matrix at the base case, as it can proceed instead with row-major encoding. The implications for parallel performance can be captured using the results from [4], which reveal that nested parallel algorithms for which the natural sequential execution has low cache complexity will also attain good cache complexity on parallel machines with private or shared caches. In this framework, our adaptation combines improved temporal locality using the Morton-hybrid order for the serial algorithm as well as optimal work and critical path length for the multithreaded version.

Performance Analysis We now verify that the cost of increased recursive MM calls for the cache-oblivious sub-matrix multiplication is significantly compensated for by the improvement in temproal locality thanks to the Morton-hybrid order. We use a Pentium IV of 2.8 GHz processor speed, with an 8 KB L1 cache and a 512 KB L2 cache. It runs linux version 2.6.11 and gcc compiler version 4.0.0. We generate random Morton-hybrid matrices and multiply random sub-matrices of these matrices using both the default and cache-oblivious algorithms. To neutralise the effect of modular aritmetic over finite fields and to be able to exclusively account for the gains induced by the Morton-hybrid order, the random matrices we generate are taken over the binary field. According to [3], $T^{\prime}=32$ is the typical value for the truncation size for block recursive matrix algorithms of floating point entries that shows improvements in cache misses and cycles for Morton-hybrid, default MM. Recall the multiplication of rectangular sub-matrices $S_{A}=S_{B}\cdot S_{C}$ , where $A$ , $B$ and $C$ are square and in Morton-hybrid order. The dimensions of the square matrices are of no significance, since the multiplication kernel is operating on the rectangular sub-matrices. We thus partition Morton-hybrid matrices of dimensions $N=2048$ and multiply sub-matrices of these Morton-hybrid matrices of varying sizes. Each experiment is distinguished using varying indices $\sigma_{S_{M}}$ of the starting entries of each $S_{M}$ and varying dimensions $r_{S_{M}}$ and $c_{S_{M}}$ . Because of the variation in sizes across each experiment we do not report on the run-times of each but rather choose to report on the percentage of increase, or decrease, in the number of base case calls made by the cache-oblivious over the default algorithm and the associated percentage of improvement. We record the number of recursive MM calls made to the base case of each of the two algorithms and the total time taken by the overall multiplication to finish. The results are presented in Table 1. We interpret it using the fifth row, say, as an arbitrary example. Of all 468 experiments run in total, about 9% of them exhibited about 34% increase in recursive calls made by the cache-oblivious over the default algorithm. The average, maximum, and minimum percentages of improvement in run-time across this batch of experiments is shown thereafter, and are all staggeringly high. Examining all rows, one can see that no matter what the increase in MM recursive calls has been, this hardly affects the high percentages of improvement. The reductions in cache misses as a result of the cache-oblivious algorithm overwhelm the cost to handle extra recursive calls.

Bibliography13

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] F. K. Abu Salem and M. Al Arab. “Comparative study of space filling curves for cache oblivious TU Decomposition”, extended report, http://arxiv.org/abs/1612.06069
2[2] M. D. Adams and D. S. Wise. “Fast additions on masked integers”, in SIGPLAN Not. , 41(5):39–45, 2006.
3[3] M. D. Adams and D. S. Wise. “Seven at one stroke: results from a cache-oblivious paradigm for scalable matrix algorithms”, in MSPC ’06 ,41–50, ACM Press, 2006.
4[4] G. Blelloch, P. B. Gibbons, and H.-V. Simhadri. “Low depth cache-oblivious algorithms”, in SPAA 2010 , pp. 189-199, ACM Press, 2010.
5[5] N. Chen, N. Wang, and B. Shi. “A new algorithm for encoding and decoding the hilbert order”, in Softw. Pract. Exper. , 37(8):897–908, 2007.
6[6] T. H. Cormen, C. E. Leiserson, R. L. Rivest and C. Stein. Introduction to Algorithms, 3rd edition, MIT Press.
7[7] J.G. Dumas and J.L. Roche. “A parallel block algorithm for the exact triangulization of rectangular matrices”, in SPAA 2001 , pp. 324-325, ACM Press, 2001.
8[8] Jeremy D. Frens and David S. Wise. “QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism”, in P Po PP ’03 , pp. 144–154, ACM Press, 2003.