Strassen's Algorithm for Tensor Contraction

Jianyu Huang; Devin A. Matthews; Robert A. van de Geijn

arXiv:1704.03092·cs.MS·April 12, 2017

Strassen's Algorithm for Tensor Contraction

Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn

PDF

Open Access

TL;DR

This paper extends Strassen's algorithm to tensor contraction, a multi-dimensional generalization of matrix multiplication, demonstrating practical speedups by optimizing tensor layout and avoiding transpositions.

Contribution

First practical demonstration of accelerating tensor contraction with Strassen's algorithm using a novel tensor layout and optimized implementation techniques.

Findings

01

Achieved up to 1.3x speedup on modern architectures

02

Introduced a Block-Scatter-Matrix tensor layout for efficient computation

03

Avoided explicit transpositions and extra workspace in implementation

Abstract

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen's algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassen's algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen's algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance…

Equations16

X=\left(\begin{array}[]{c | c}X_{0}&X_{1}\\ \hline\cr X_{2}&X_{3}\end{array}\right)\quad\mbox{for $X\in\{A,B,C\}$}

X=\left(\begin{array}[]{c | c}X_{0}&X_{1}\\ \hline\cr X_{2}&X_{3}\end{array}\right)\quad\mbox{for $X\in\{A,B,C\}$}

{\begin{array}[]{l @{\hspace{1pt}} c @{\hspace{1pt}} l l r}M_{0}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{0}+A_{3})(B_{0}+B_{3});&C_{0}+\!\!=M_{0};C_{3}+\!\!=M_{0};\\ M_{1}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{2}+A_{3})B_{0};&C_{2}+\!\!=M_{1};C_{3}-\!\!=M_{1};\\ M_{2}\hfil\hskip 1.0&=\hfil\hskip 1.0&A_{0}(B_{1}-B_{3});&C_{1}+\!\!=M_{2};C_{3}+\!\!=M_{2};\\ M_{3}\hfil\hskip 1.0&=\hfil\hskip 1.0&A_{3}(B_{2}-B_{0});&C_{0}+\!\!=M_{3};C_{2}+\!\!=M_{3};\\ M_{4}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{0}+A_{1})B_{3};&C_{1}+\!\!=M_{4};C_{0}-\!\!=M_{4};\\ M_{5}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{2}-A_{0})(B_{0}+B_{1});&C_{3}+\!\!=M_{5};\\ M_{6}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{1}-A_{3})(B_{2}+B_{3});&C_{0}+\!\!=M_{6};\end{array}}

{\begin{array}[]{l @{\hspace{1pt}} c @{\hspace{1pt}} l l r}M_{0}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{0}+A_{3})(B_{0}+B_{3});&C_{0}+\!\!=M_{0};C_{3}+\!\!=M_{0};\\ M_{1}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{2}+A_{3})B_{0};&C_{2}+\!\!=M_{1};C_{3}-\!\!=M_{1};\\ M_{2}\hfil\hskip 1.0&=\hfil\hskip 1.0&A_{0}(B_{1}-B_{3});&C_{1}+\!\!=M_{2};C_{3}+\!\!=M_{2};\\ M_{3}\hfil\hskip 1.0&=\hfil\hskip 1.0&A_{3}(B_{2}-B_{0});&C_{0}+\!\!=M_{3};C_{2}+\!\!=M_{3};\\ M_{4}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{0}+A_{1})B_{3};&C_{1}+\!\!=M_{4};C_{0}-\!\!=M_{4};\\ M_{5}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{2}-A_{0})(B_{0}+B_{1});&C_{3}+\!\!=M_{5};\\ M_{6}\hfil\hskip 1.0&=\hfil\hskip 1.0&(A_{1}-A_{3})(B_{2}+B_{3});&C_{0}+\!\!=M_{6};\end{array}}

C_{π_{C} (I_{m} J_{n})} : =

C_{π_{C} (I_{m} J_{n})} : =

L O C_{g s t e n} (T_{T_{d}}, S_{T}) = k = 0 \sum d - 1 t_{k} \cdot s_{A; k} .

L O C_{g s t e n} (T_{T_{d}}, S_{T}) = k = 0 \sum d - 1 t_{k} \cdot s_{A; k} .

L O C_{sc a t ma t} (M_{i, j}, r sc a t_{T}, csc a t_{T}) = r sc a t_{T; i} + csc a t_{T; j},

L O C_{sc a t ma t} (M_{i, j}, r sc a t_{T}, csc a t_{T}) = r sc a t_{T; i} + csc a t_{T; j},

r sc a t_{T; i} =

r sc a t_{T; i} =

M = α (X + δ Y) (V + ϵ W); C + = γ_{0} M; D + = γ_{1} M;

M = α (X + δ Y) (V + ϵ W); C + = γ_{0} M; D + = γ_{1} M;

Z

Z

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Algorithms and Data Compression

Full text

\TPMargin

5pt

Strassen’s Algorithm for Tensor Contraction

FLAME Working Note #84

Jianyu Huang*†

Devin A. Matthews†

Robert A. van de Geijn*†

*Department of Computer Science

†Institute for Computational Engineering and Sciences

The University of Texas at Austin, Austin, TX 78712

{jianyu@cs., dmatthews@, rvdg@cs.}utexas.edu

( April 3, 2017 )

Abstract

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen’s algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassen’s algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen’s algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to $1.3\times$ speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup.

1 Introduction

Standing on the shoulders of giants. This paper builds upon a number of recent developments: The GotoBLAS algorithm for matrix multiplication (GEMM) [1] that underlies the currently fastest implementations of GEMM for CPUs; The refactoring of the GotoBLAS algorithm as part of the BLAS-like Library Instantiation Software (BLIS) [2, 3], which exposes primitives for implementing BLAS-like operations; The systematic parallelization of the loops that BLIS exposes so that high-performance can be flexibly attained on multicore and many-core architectures [4]; The casting of tensor contraction (TC) in terms of the BLIS primitives [5, 6] without requiring the transposition (permutation) used by traditional implementations; The practical high-performance implementation of the classical Strassen’s algorithm (Strassen) [7] in terms of variants of the BLIS primitives; and the extension of this implementation [8] to a family of Strassen-like algorithms (Fast Matrix Multiplication algorithms) [9]. Together, these results facilitate what we believe to be the first extension of Strassen’s algorithm to TC.

Contributions. This paper describes how to extend Strassen’s algorithm to TC without the explicit transposition of data that inherently incurs significant memory movement and workspace overhead; It provides a performance model for the cost of the resulting family of algorithms; It details the practical implementation of these algorithms, including how to exploit variants of the primitives that underlie BLIS and a data layout to memory for the tensors; It demonstrates practical speedup on modern single core and multicore CPUs; It illustrates how the local Strassen’s TC algorithm improves performance of a simple distributed memory tensor contraction. Together, these results unlock a new frontier for the research and application of Strassen’s algorithm.

Related work. To the best of our knowledge, this work represents the first implementation of Strassen’s algorithm for tensor contraction. In the context of Strassen for matrices, there have been a variety of practical implementations [10, 11, 12, 9], including the closely related implementation of Strassen using the BLIS framework [7] which this paper is based on.

For tensor contraction, recent work on high-performance tensor contraction [5, 6] serves as the motivation and basis for our present work, while other research has focused on algorithms using tensor slicing [13, 14, 15, 16] or on improving the efficiency of the so-called ttdt algorithm for tensor contraction [17, 18, 19, 20], where input tensors $\bm{\mathcal{{A}}}$ and $\bm{\mathcal{{B}}}$ are Transposed (permuted) and then used in a standard dgemm algorithm, with the output then being Transposed and accumulated onto the tensor $\bm{\mathcal{{C}}}$ . ttdt could be used to construct a Strassen algorithm for TC by transposing subtensors into submatrices and vice versa and using a matrix implementation of Strassen instead of dgemm. However, we will show that this algorithm is essentially the same as our Naive Strassen algorithm, which is often less efficient than the other algorithms that we have implemented.

The gett algorithm [6] is a high-performance tensor contraction implementation similar in many ways to the BLIS-based implementation in [5]. As in [5], which our present work is based on, formation of linear combinations of input subtensors of $\bm{\mathcal{{A}}}$ and $\bm{\mathcal{{B}}}$ and output to multiple subtensors of $\bm{\mathcal{{C}}}$ could be fused with the internal tensor transposition and micro-kernel steps of gett. However, the implementation would be restricted to regular subtensors rather than more general submatrices, which could have possible negative performance implications.

2 Background

We briefly review how high-performance GEMM is implemented, before discussing the practical implementations of high-performance Strassen for GEMM.

2.1 High-performance GEMM

Let $A$ , $B$ , and $C$ be matrices of sizes $N_{i}\times N_{p}$ , $N_{p}\times N_{j}$ , and $N_{i}\times N_{j}$ , respectively. A general matrix-matrix multiplication (gemm) in the BLAS interface [21] is expressed as $C:=\alpha AB+\beta C$ . Written element-wise, $C_{i,j}=\alpha\sum_{p=0}^{N_{p}-1}A_{i,p}\cdot B_{p,j}+\beta C_{i,j}$ , where $\cdot$ denotes scalar multiplication, and $\alpha$ and $\beta$ are scalars. We focus on the special case $\alpha=1$ and $\beta=1$ henceforth for brevity.

A key insight underlying modern high-performance implementations of gemm is to organize the computations by partitioning the operands into blocks for temporal locality, and to pack (copy) such blocks into contiguous buffers that fit into various levels of memory for spatial locality. Figure 1(left) illustrates the GotoBLAS algorithm as implemented in BLIS. Cache blocking parameters $\{m_{C},n_{C},k_{C}\}$ determine the submatrix sizes of $B_{p}$ ( $k_{C}\times n_{C}$ ) and $A_{i}$ ( $m_{C}\times k_{C}$ ), such that they fit in various caches (we use the standard gemm dimensions $\{m,n,k\}$ in defining blocking parameters for brevity and consistency with [2], but note that the meaning of $\{m,n,k\}$ alone is changed in $\S$ 2.3). During the computation, row panels $B_{p}$ are contiguously packed into buffer $\widetilde{B}_{p}$ to fit in the L3 cache. Blocks $A_{i}$ are similarly packed into buffer $\widetilde{A}_{i}$ to fit in the L2 cache. Register block sizes $\{m_{R},n_{R}\}$ relate to submatrices in registers that contribute to $C$ . In the micro-kernel (the inner most loop), a small $m_{R}\times n_{R}$ micro-tile of $C$ is updated by pair of $m_{R}\times k_{C}$ and $k_{C}\times n_{R}$ slivers of $\widetilde{A}_{i}$ and $\widetilde{B}_{p}$ . The above parameters can be analytically chosen [22].

2.2 High-performance Strassen

If the three operands are partitioned into quadrants,

[TABLE]

then it can be checked that the operations

[TABLE]

compute $C:=AB+C$ , with seven instead of eight (sub)matrix multiplications, reducing the cost by a factor of $7/8$ (ignoring a lower order number of extra additions). If all matrices are square and of size $N\times N$ , theoretically this single step of Strassen can be applied recursively, resulting in the classical Strassen with a cost of $O(N^{2.801})$ .

In practice, only a few levels of the recursion are leveraged because the reduction in computations are quickly overwhelmed by the cost of extra additions and extra memory movements. Additionally, Strassen is known to experience degradation in numerical stability especially when more than two levels of recursion are incorporated [23, 24, 25].

Figure 1(right) illustrates the modifications done in [7] to make Strassen practical. During the packing process, the additions of the submatrices $A$ and $B$ can be incorporated into the packing buffers $\widetilde{A}_{i}$ and $\widetilde{B}_{p}$ , avoiding extra memory movement and reducing workspace requirements. In the micro-kernel, once a submatrix that contributes to $C$ is computed in machine registers, it can be directly added to the appropriate parts of multiple submatrices of $C$ , thus avoiding the need for temporary intermediate matrices $M_{i}$ , again avoiding extra memory movement. As demonstrated in [7], this approach makes Strassen practical for smaller matrices and matrices of special shape (importantly, for rank-k updates, where $N_{p}$ is relatively small comparing to $N_{i}$ and $N_{j}$ ). This research is pushed further [8] by revealing that Strassen performs relatively better than most other Strassen-like FMM algorithms with one or two levels of recursions, when modeled as well as in practice. For this reason, we do not extend those FMM algorithms to TC in this paper, although it may be worthwhile in future work to pursue certain of these algorithms for highly non-square tensor contraction shapes.

2.3 High-performance Tensor Contraction

The definition and notation of tensors and tensor contraction are briefly reviewed before describing the tensor layouts that enable high-performance tensor contraction.

Tensor. The concept of matrices is extended to multiple dimensions by defining a general $d$ -D tensor $\bm{\mathcal{{T}}}\in\mathbb{R}^{N_{\bm{\mathcal{{T}}};0}\times\dots\times N_{\bm{\mathcal{{T}}};d-1}}$ as a multidimensional array of scalar elements, where the length of the $k$ -th dimension is given by $N_{\bm{\mathcal{{T}}};k}\in\mathbb{N}$ . Individual elements are referenced by indexing $\bm{\mathcal{{T}}}$ by an ordered index bundle $T_{d}=\{t_{0},\ldots,t_{d-1}\}$ , such that $\bm{\mathcal{{T}}}_{T_{d}}\in\mathbb{R}$ for all $T_{d}\in N_{\bm{\mathcal{{T}}};0}\times\ldots\times N_{\bm{\mathcal{{T}}};d-1}=N_{t_{0}}\times\ldots\times N_{t_{d-1}}$ . In general we will denote the dimension of a tensor $\bm{\mathcal{{T}}}$ as $d_{\bm{\mathcal{{T}}}}$ , the index length $N_{x}\in\mathbb{N}$ as the length of the dimension that is indexed by some symbol $x$ , and the bundle length $N_{T_{d}}\in\mathbb{N}$ as the total length of a index bundle $T_{d}$ , i.e. $N_{T_{d}}=\prod_{t\in T_{d}}N_{t}=N_{t_{0}}\cdot\ldots\cdot N_{t_{d-1}}$ .

Tensor Contraction. Let $\bm{\mathcal{{A}}}$ , $\bm{\mathcal{{B}}}$ , and $\bm{\mathcal{{C}}}$ be general tensors of any dimensionality satisfying $d_{\bm{\mathcal{{A}}}}+d_{\bm{\mathcal{{B}}}}-d_{\bm{\mathcal{{C}}}}=2k,\;k\in\mathbb{N}$ . Then, let $I_{m}$ , $J_{n}$ , and $P_{k}$ be index bundles with $m=d_{\bm{\mathcal{{A}}}}-k$ and $n=d_{\bm{\mathcal{{B}}}}-k$ . Lastly, let the index reordering $\pi_{\bm{\mathcal{{A}}}}(a_{0},\ldots,a_{d_{\bm{\mathcal{{A}}}}-1})=\{a_{\pi_{\bm{\mathcal{{A}}}}(0)},\ldots,a_{\pi_{\bm{\mathcal{{A}}}}(d_{\bm{\mathcal{{A}}}}-1)}\}$ be defined by the bijective map $\pi_{\bm{\mathcal{{A}}}}\colon\{0,\ldots,d_{\bm{\mathcal{{A}}}}-1\}\to\{0,\ldots,d_{\bm{\mathcal{{A}}}}-1\}$ , and similarly for $\pi_{\bm{\mathcal{{B}}}}$ and $\pi_{\bm{\mathcal{{C}}}}$ . The general definition of tensor contraction is then given by,

[TABLE]

for scalars $\alpha,\,\beta\in\mathbb{R}$ . The indices in the bundles $I_{n}$ and $J_{m}$ are generally called free, external, or uncontracted indices, while the indices in the $P_{k}$ bundle are called bound, internal, or contracted indices. In the following we will assume that $\alpha=1$ and $\beta=1$ , and suppress the explicit summation over $P_{k}$ . The number of leading-order floating point operations required for tensor contraction is $2{{N_{I_{m}}}}{}\cdot{{N_{J_{n}}}}{}\cdot{{N_{P_{k}}}}{}$ = $2\prod_{i\in I_{m}}N_{i}\cdot\prod_{j\in J_{n}}N_{j}\cdot\prod_{p\in P_{k}}N_{p}$ . If the length of each dimension is $O(N)$ , the tensor contraction operation requires $O(N^{m+n+k})$ flops.

In Figure 2(a), the tensor contraction $\bm{\mathcal{{C}}}_{a,b,c}+\!\!=\bm{\mathcal{{A}}}_{d,c,a}\cdot\bm{\mathcal{{B}}}_{d,b}$ is illustrated. In the general notation this gives $I_{m}=\{a,c\}$ , $J_{m}=\{b\}$ , $P_{k}=\{d\}$ , $\pi_{\bm{\mathcal{{A}}}}(0,1,2)=\{2,1,0\}$ , $\pi_{\bm{\mathcal{{B}}}}(0,1)=\{0,1\}$ , and $\pi_{\bm{\mathcal{{C}}}}(0,1,2)=\{0,2,1\}$ . The number of floating point operations and memory accesses for this contraction is identical to that for a matrix multiplication of $(N_{a}\cdot N_{c})\times N_{d}$ , $N_{d}\times N_{b}$ , and $(N_{a}\cdot N_{c})\times N_{b}$ matrices.

General stride layouts. The well-known column-major and row-major matrix layouts may be extended to tensors as the generalized column- and row-major tensor layouts, where elements are stored contiguously along the first dimension or last dimension, respectively. However, in general we may assume only a general tensor layout, which extends the general matrix layout [2] by replacing matrix row and column strides ( $rs_{M}$ and $cs_{M}$ ) with a stride associated to each tensor dimension. For a $d$ -dimensional tensor $\bm{\mathcal{{T}}}$ indexed by $T_{d}$ , the strides $s_{\bm{\mathcal{{T}}};k}\in\mathbb{N}$ for all $0\leq k<d$ form the set $S_{\bm{\mathcal{{T}}}}=\{s_{\bm{\mathcal{{T}}};0},\ldots,s_{\bm{\mathcal{{T}}},d-1}\}$ , which gives element LOCations relative to $\bm{\mathcal{{T}}}_{0,\ldots,0}$ ,

[TABLE]

For convenience, we may also refer to the stride of the dimension indexed in $\bm{\mathcal{{T}}}$ by a particular symbol $x$ as $s_{\bm{\mathcal{{T}}};x}$ . The generalized column-major and row-major layouts can also be represented using a general stride layout, in which case $s_{\bm{\mathcal{{T}}};k}=\prod_{l=0}^{k-1}N_{\bm{\mathcal{{T}}};l}$ and $s_{\bm{\mathcal{{T}}};k}=\prod_{l=k+1}^{d-1}N_{\bm{\mathcal{{T}}};l}$ , respectively.

In Figure 2(a), $\bm{\mathcal{{C}}}$ is stored in the generalized column-major layout. The entries represents the location of the element $\bm{\mathcal{{C}}}_{{a,b,c}}$ relative to the element $\bm{\mathcal{{C}}}_{{0,0,0}}$ in the tensor storage layout. $s_{\bm{\mathcal{{C}}};a}=1$ , $s_{\bm{\mathcal{{C}}};b}=N_{\bm{\mathcal{{C}}};a}=4$ , and $s_{\bm{\mathcal{{C}}};c}=N_{\bm{\mathcal{{C}}};a}\cdot N_{\bm{\mathcal{{C}}};b}=32$ . The element location of $\bm{\mathcal{{C}}}_{{a,b,c}}$ is $a\cdot s_{\bm{\mathcal{{C}}};a}+b\cdot s_{\bm{\mathcal{{C}}};b}+c\cdot s_{\bm{\mathcal{{C}}};c}=a+4b+32c$ .

Block Scatter Matrix View. In [5] it is shown that tensors can be represented in a matrix-centric layout that allows for a simple but efficient implementation of tensor contraction using the BLIS framework. The main idea of that work is that the locations of tensor elements of $\bm{\mathcal{{T}}}$ can be described in a matrix format, the scatter matrix layout, for some matrix $M$ very similarly to the general stride matrix layout,

[TABLE]

where $rscat_{\bm{\mathcal{{T}}}}\in\mathbb{N}^{N_{i}}$ and $cscat_{\bm{\mathcal{{T}}}}\in\mathbb{N}^{N_{j}}$ . If we define the index bundle $I_{p}$ of size $p$ as the set of indices of $\bm{\mathcal{{T}}}$ that map to columns of $M$ , and the index bundle $J_{q}$ of size $q$ (such that $p+q=d_{\bm{\mathcal{{T}}}}$ ) as the set of indices that map to rows of $M$ , then by inspection of the general stride layout we can see that the scatter vector $rscat_{\bm{\mathcal{{T}}}}$ with respect to $I_{p}$ is given by,

[TABLE]

and similarly for $cscat_{\bm{\mathcal{{T}}}}$ with respect to $J_{q}$ .

The relative location of $\bm{\mathcal{{C}}}_{{a,b,c}}$ in Figure 2(a), or $C_{{i,j}}$ in the matrix view of $\bm{\mathcal{{C}}}$ in Figure 2(b) is $rscat_{\bm{\mathcal{{C}}};i}+cscat_{\bm{\mathcal{{C}}};j}$ (e.g, $LOC(\bm{\mathcal{{C}}}_{2,3,1})=LOC(C_{6,3})=rscat_{\bm{\mathcal{{C}}};6}+cscat_{\bm{\mathcal{{C}}};3}=34+12$ ). Here: (1) $rscat_{\bm{\mathcal{{C}}};i}=a\cdot s_{\bm{\mathcal{{C}}};a}+c\cdot s_{\bm{\mathcal{{C}}};c}=a+32c,i=a+c\cdot N_{a}=a+4c,\forall\,\{a,c\}\in N_{a}\times N_{c}$ ; (2) $cscat_{\bm{\mathcal{{C}}};j}=b\cdot s_{\bm{\mathcal{{C}}};b}=8b,j=b,\forall\,\{b\}\in N_{b}$ . These scatter vectors are shown on top and left of the matrix view of $\bm{\mathcal{{C}}}$ in Figure 2(b).

The general definition of tensor contractions gives a natural mapping from tensors to matrices through the index bundles $I_{m}$ , $J_{n}$ , and $P_{k}$ . Thus, the bundle $I_{m}$ defines $rscat_{\bm{\mathcal{{A}}}}$ and $rscat_{\bm{\mathcal{{C}}}}$ , $J_{n}$ defines $cscat_{\bm{\mathcal{{B}}}}$ and $cscat_{\bm{\mathcal{{C}}}}$ , and $P_{k}$ defines $cscat_{\bm{\mathcal{{A}}}}$ and $rscat_{\bm{\mathcal{{B}}}}$ . If we define matrices $A_{i,k}$ , $B_{k,j}$ , and $C_{i,j}$ and imbue them with scatter matrix layouts using the scatter vectors from the corresponding tensors, we can perform tensor contraction using the high-performance matrix multiplication algorithm introduced in $\S$ 2.1, without explicitly forming those matrices in extra working buffers.

Since we are using the GotoBLAS/BLIS algorithm, we can leverage the fact that these matrices will be partitioned to introduce further optimizations. In the micro-kernel (Figure 1), the matrix $C$ will be partitioned into $m_{R}\times n_{R}$ blocks and the matrices $A$ and $B$ will be partitioned into $m_{R}\times k_{C}$ and $k_{C}\times n_{R}$ slivers, respectively. If we further partition $k_{C}$ into smaller increments of a new parameter $k_{R}$ , on the order of $m_{R}$ and $n_{R}$ , then we will end up with only matrix blocks of very small size. As in [5], we can partition the scatter vectors into very small blocks of size $m_{R}$ , $n_{R}$ , and $k_{R}$ as well, and use optimized algorithms in the packing kernels and micro-kernel when the scatter values for the current block are regularly spaced (i.e. strided). The regular strides for each $m_{R}$ / $n_{R}$ / $k_{R}$ -sized block of $rscat_{\bm{\mathcal{{T}}}}$ / $cscat_{\bm{\mathcal{{T}}}}$ , or zero if no regular stride exists, are collected in a row/column block scatter vector $rbs_{\bm{\mathcal{{T}}}}$ / $cbs_{\bm{\mathcal{{T}}}}$ of length $\lceil\tfrac{N_{i}}{m_{R}}\rceil$ / $\lceil\tfrac{N_{i}}{n_{R}}\rceil$ / $\lceil\tfrac{N_{i}}{k_{R}}\rceil$ and similarly for the other row/column scatter vectors. With these block scatter vectors, we can then utilize efficient SIMD vector load/store instructions for stride-one index, or vector gather/scatter fetch instructions for stride- $n$ index, in a favorable memory access pattern.

In Figure 2(b), assuming $m_{R}=n_{R}=k_{R}=4$ , $rbs_{\bm{\mathcal{{C}}}}=\{1,1\}$ , and $cbs_{\bm{\mathcal{{C}}}}=\{4,4\}$ , since the regular strides for each 4 elements of $rscat_{\bm{\mathcal{{C}}}}$ and $cscat_{\bm{\mathcal{{C}}}}$ are 1 and 4, respectively.

3 Strassen for Tensor Contraction

The operations summarized in $\S$ 2.2 are all special cases of

[TABLE]

for appropriately chosen $\gamma_{0},\gamma_{1},\delta,\epsilon\in\{-1,0,1\}$ . Here, $X$ and $Y$ are submatrices of $A$ , $V$ and $W$ are submatrices of $B$ , and $C$ and $D$ are submatrices of the original $C$ . As in [7], this scheme can be extended to multiple levels of Strassen.

Instead of partitioning the tensor $\bm{\mathcal{{A}}}$ into subtensors $\bm{\mathcal{{X}}}$ and $\bm{\mathcal{{Y}}}$ and so on for $\bm{\mathcal{{B}}}$ and $\bm{\mathcal{{C}}}$ , we partition the matrix representations $A$ , $B$ , and $C$ as in the matrix implementation of Strassen. Figure 2 provides an example to illustrate the partition mechanism. Block scatter matrix layouts for these submatrices may be trivially obtained by partitioning the scatter- and block scatter vectors of the entire matrices along the relevant dimensions. Once imbued with the appropriate layouts, these submatrices may then be used in the BLIS-based Strassen of [7] along with modifications the the packing kernels and micro-kernel as in [5].

In fusing these two methodologies, we need to further address the consideration of multiple block scatter vectors as required when packing and executing the micro-kernel. Methods for dealing with this issue are described in $\S$ 4.1. The advantage of using matrix partitions (which is enabled by the block scatter layout) instead of tensor partitions is primarily that only the product of the lengths of each index bundle, $\{N_{I_{m}}$ , $N_{J_{n}}$ , $N_{P_{k}}\}$ , must be considered when partitioning, and not the lengths of individual tensor dimensions. For example, Strassen may be applied to any tensor contraction where at least one dimension in each bundle is even in our approach, whereas the $last$ dimension (or rather, the dimension with the longest stride) must be even when using subtensors.111A dimension other than the last could also be chosen for partitioning, but the spatial locality of the partitioning would be destroyed. Additionally, when applying methods for performing Strassen on odd-length matrices to tensors, such as dynamical peeling as in [7] or zero-padding, the overhead is larger for subtensors since a single dimension must be padded or peeled rather than the entire index bundle.

4 Implementations

The modifications to the block scatter matrix-based packing kernel and micro-kernel as described in [5] for Strassen are detailed.

4.1 Packing

When packing submatrices for Strassen using (4), multiple scatter- and block scatter vectors must be considered. In our initial implementation, the block scatter vector entries for the corresponding block in both input submatrices (or all submatrices for $L$ -level Strassen) are examined. If all entries are non-zero, then the constant stride is used in packing the current block.222Note that when non-zero, the block scatter vector entries for different submatrices will always be equal. Otherwise, the scatter vectors are used when packing the current block, even though one or more of the input submatrix blocks may in fact have a regular stride. In future work, we plan to exploit these cases for further performance improvements.

In addition to the ABC Strassen algorithm, we also implement the AB Strassen and Naive Strassen algorithms of [7] for tensor contraction. In the AB Strassen algorithm, intermediate submatrices $M$ are explicitly stored and then accumulated into submatrices of $C$ . We store the $M$ submatrices as regular, densely-stored matrices, and handle their accumulation onto block scatter matrix layout submatrices of $C$ using an adapted version of the Strassen block scatter matrix packing kernel. In the Naive Strassen algorithm, submatrices of $A$ and $B$ are also explicitly copied using a modified packing kernel and stored as regular submatrices. Thus, the Naive Strassen algorithm for tensor contraction is extremely similar to a ttdt-based Strassen algorithm (see $\S$ 1), except that the tensors are not required to be partitioned into regular subtensors.

4.2 Micro-kernel

As in [7], we use assembly-coded micro-kernels that include the update to several submatrices of $C$ from registers. In order to use this efficient update, $all$ block scatter vector entries for the relevant submatrix blocks of $C$ must be non-zero. Unlike in the packing kernel implementation, the case where only one or more of the submatrix blocks is regular stride would be more difficult to take advantage of, as the micro-kernel would have to be modified to flexibly omit or redirect individual submatrix updates.

5 Performance Model

In [7], a performance model was proposed to predict the execution time $T$ for variations of Strassen for matrices. In this section, we extend that performance model to estimate the execution time $T$ of ABC, AB and Naive variations of $L$ -level Strassen for TC and the high-performance TC routine we build on (see $\S$ 2.3; using TBLIS implementation [5, 26] introduced in $\S$ 6; denoted as tblis henceforth). Due to the high dimensionality of tensors and enormous types and combinations of permutations (transpositions) in TC, it is impractical to exhaustively search for every tensor shape and tensor problem size to find the best variation. Performance modeling helps us to better understand the memory footprint and computation of different Strassen implementations for TC, and at least reduce the search space to pick the right implementation. In our model, besides input problem size, block sizes, and the hardware parameters such as the peak GFLOPS and bandwidth, $T$ also depends on the shape of the tensors, and the extra permutations (transpositions) in the packing routines and in the micro-kernel.

Notations. We summarize our notations in Figure 3. The total execution time, $T$ , can be decomposed of arithmetic time $T_{a}$ and memory time $T_{m}$ (\raisebox{-0.9pt}{2}⃝ in Figure 4).

Arithmetic operations. As shown in \raisebox{-0.9pt}{3}⃝, $T_{a}$ includes (sub)tensor contraction ( $T_{a}^{\times}$ ) and (sub)tensor additions/permutations ( $T_{a}^{\bm{\mathcal{{A}}}_{+}}$ , $T_{a}^{\bm{\mathcal{{B}}}_{+}}$ , $T_{a}^{\bm{\mathcal{{C}}}_{+}}$ ). The corresponding coefficients $W_{a}^{X}$ for tblis TC and $L$ -level various Strassen TC are enumerated in Figure 4. Note that $T_{a}^{X}$ is calculated by multiplying the unit time $\tau_{a}$ with the arithmetic operation number in the middle table of Figure 4. We compute $\tau_{a}$ through \raisebox{-0.9pt}{5}⃝. The penalty factor $\rho_{a}\in(0,1]$ is introduced, due to the extra computations involved in $rscat_{\bm{\mathcal{{T}}}}$ / $cscat_{\bm{\mathcal{{T}}}}$ / $rbs_{\bm{\mathcal{{T}}}}$ / $cbs_{\bm{\mathcal{{T}}}}$ , and the slow micro-kernel invocation when the corresponding entries in $rbs_{\bm{\mathcal{{C}}}}$ or $cbs_{\bm{\mathcal{{C}}}}$ are [math] (see $\S$ 4.2; non-regular stride access). We penalize the performance drops caused by these factors by setting $\rho_{a}=0.95$ .

Memory operations. Similar to [7], we assume two layers of modern memory hierarchy: slow main memory and fast caches. For write operations, the lazy write-back policy is enforced such that the time for writing into fast caches can be hidden. For read operations, the latency for accessing the slow main memory is counted, while the latency for accessing caches can be ignored. With these assumptions, $T_{m}$ can be broken down into three parts (\raisebox{-0.9pt}{4}⃝ in Figure 4): updating the temporary buffer that are parts of Naive Strassen/AB Strassen ( $W_{m}^{\bm{\mathcal{{T}}}_{+}}\cdot T_{m}^{\bm{\mathcal{{T}}}_{+}}$ ); memory packing shown in Figure 1 ( $W_{m}^{\bm{\mathcal{{A}}}_{\times}}\cdot T_{m}^{\bm{\mathcal{{A}}}_{\times}}$ , $W_{m}^{\bm{\mathcal{{B}}}_{\times}}\cdot T_{m}^{\bm{\mathcal{{B}}}_{\times}}$ ) ; updating the submatrices of $C$ shown in Figure 1 ( $W_{m}^{\bm{\mathcal{{C}}}_{\times}}\cdot T_{m}^{\bm{\mathcal{{C}}}_{\times}}$ ). The coefficients $W_{m}^{X}$ are tabulated in Figure 4. $T_{m}^{X}$ is a function of block sizes $\{m_{C},k_{C},n_{C}\}$ in Figure 1, and the bundle lengths $\{{{N_{I_{m}}}}{}/2^{L},{{N_{J_{n}}}}{}/2^{L},{{N_{P_{k}}}}{}/2^{L}\}$ because the memory operation can repeat multiple times according to which loop they reside in. Figure 4(middle) characterizes each memory operation term by its read/write type and the amount of memory in units of 64-bit double precision elements. $T_{m}^{{\widetilde{A}}_{\times}}$ , $T_{m}^{{\widetilde{B}}_{\times}}$ are omitted in \raisebox{-0.9pt}{4}⃝ due to the lazy write-back policy assumption. Because of the software prefetching effects, there is an extra parameter $\lambda\in(0.5,1]$ for $T_{m}^{C_{\times}}$ , which denotes the prefetching efficiency. In order to get $T_{m}^{X}$ , the memory operation number needs to be multiplied by the bandwidth $\tau_{b}$ . We compute $\tau_{b}$ through \raisebox{-0.9pt}{6}⃝. We penalize the effect of permutations without stride-one index accesss (see $\S$ 4.1; the corresponding entries in neither $rbs_{\bm{\mathcal{{T}}}}$ or $cbs_{\bm{\mathcal{{T}}}}$ are 1, i.e. using scatter/gather operation, or indirect memory addressing with (3)) by setting $\rho_{b}=0.7$ . A similar parameter is introduced in [6] for regular TC.

Discussion We can estimate the run time performance of various implementations, based on the performance model presented in Figure 4. Here we define Effective GFLOPS (\raisebox{-0.9pt}{1}⃝ in Figure 4) for TC as the metric to compare the performance of various Strassen TC and tblis TC. The theoretical peak GFLOPS and bandwidth information is given in $\S$ 6. In Figure 5(left), we demonstrate the modeled and actual performance for a wide range of synthetic tensor sizes and shapes: ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}\!\!\approx\!\!{{N_{P_{k}}}}{}$ ; ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}\!\!\approx\!\!16000$ , ${{N_{P_{k}}}}{}$ varies; ${{N_{P_{k}}}}{}\!\!\approx\!\!1024$ , ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}$ vary. How we generate synthetic data is detailed in $\S$ 6.

•

For ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}\!\!\approx\!\!{{N_{P_{k}}}}{}$ , the ABC Strassen/AB Strassen implementations outperform tblis, when ${{N_{I_{m}}}}{}$ , ${{N_{J_{n}}}}{}$ , ${{N_{P_{k}}}}{}$ are as small as $2k_{C}$ , nearly 500; while Naive Strassen cannot beat tblis until the problem size is larger than 2000.

•

The “ ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}\!\!\approx\!\!16000$ , ${{N_{P_{k}}}}{}$ varies” graph shows that when ${{N_{P_{k}}}}{}$ is small, ABC Strassen performs best; when ${{N_{P_{k}}}}{}$ is large, AB Strassen performs better. The coefficients $W_{m}^{X}$ in Figure 4(bottom) help to illustrate the reasons quantitatively.

•

According to the model, when ${{N_{P_{k}}}}{}$ is equal to appropriate multiple of $k_{C}$ ( ${{N_{P_{k}}}}{}=2^{L}\cdot k_{C}$ for $L$ -level), ABC Strassen achieves the best performance. We will leverage this observation in our distributed memory experiment.

6 Experiments

We perform our experimental evaluations for synthetic data and real-world benchmarks on a single node and on a distributed memory architecture. The implementations are written in C++, utilizing AVX assembly, based on the open source TBLIS framework [26]. We compare against TBLIS’s tensor contraction routine (marked as tblis) as well as the TTT routine from MATLAB Tensor Toolbox [27] (linked with Intel MKL [28], marked as ttt) for single node, and tensor contraction routine from the Cyclops Tensor Framework [29] (also linked with Intel MKL, marked with ctf) for distributed memory.

We measure the CPU performance results on the Maverick system at the Texas Advanced Computing Center (TACC). Each node of that system consists of a dual-socket (10 cores/socket) Intel Xeon E5-2680 v2 (Ivy Bridge) processors with 256 GB memory (peak bandwidth: 59.7 GB/s with four channels) and a three-level cache (32 KB L1 data; 256 KB L2; 25.6 MB L3). The stable CPU clockrate is 3.54 GHz when a single core is utilized (28.32 GFLOPS peak, marked in the graphs) and 3.10 GHz when all ten cores are in use (24.8 GFLOPS/core peak). We disable hyper-threading explicitly and set thread affinity with KMP_AFFINITY=compact which also ensures the computation and the memory allocation all reside on the same socket.

The cache blocking parameters, $m_{C}=96$ , $n_{C}=4096$ , $k_{C}=256$ , and the register block sizes, $m_{R}=8$ , $n_{R}=4$ , are consistent with parameters used for the standard BLIS dgemm implementation for this architecture. We use the default value of $k_{R}=4$ as defined in TBLIS. This makes the size of the packing buffer $\widetilde{A}_{i}$ 192 KB and $\widetilde{B}_{p}$ 8192 KB, which then fit the L2 cache and L3 cache, respectively. Parallelization is implemented mirroring that described in [4], but with the number of threads assigned to each of the loops in Figure 1 automatically determined by the TBLIS framework.

6.1 Single node experiments

Synthetic tensor contractions. To evaluate the overall performance of various Strassen TC comparing against tblis TC for different tensor problem sizes, shapes, and permutations, we randomly generate TC test cases with 2-D to 6-D randomly permuted tensors as operands, and test all these implementations for each synthetic test case. We choose step size $256$ to sample uniformly $\{{{N_{I_{m}}}}{},{{N_{J_{n}}}}{},{{N_{P_{k}}}}{}\}$ for various tensor bundle lengths: square: ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}\!\!\approx\!\!{{N_{P_{k}}}}{}$ ; rank- ${{N_{P_{k}}}}{}$ : ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}\!\!\approx\!\!16000$ , ${{N_{P_{k}}}}{}$ varies; fixed- ${{N_{P_{k}}}}{}$ : ${{N_{P_{k}}}}{}\!\!\approx\!\!1024$ , ${{N_{I_{m}}}}{}\!\!\approx\!\!{{N_{J_{n}}}}{}$ vary. For each bundle length $\{{{N_{I_{m}}}}{},{{N_{J_{n}}}}{},{{N_{P_{k}}}}{}\}$ , we randomly generate three $\{I_{m},J_{n},P_{k}\}$ 1-D, 2-D, or 3-D bundles, such that the product of each index length is close to $\{{{N_{I_{m}}}}{},{{N_{J_{n}}}}{},{{N_{P_{k}}}}{}\}$ . The order of $\{I_{m},J_{n},P_{k}\}$ is then randomly permuted.

The generated bundle lengths may not exactly match the original sampled bundle lengths. When we plot the actual performance of these synthetic test cases, we set ${\widetilde{N_{I_{m}}}}{}={\widetilde{N_{J_{n}}}}{}={\widetilde{N_{P_{k}}}}{}=({{N_{I_{m}}}}{}\cdot N_{J_{n}}\cdot N_{P_{k}})^{1/3}$ for the square bundle lengths; ${\widetilde{N_{P_{k}}}}{}={{N_{I_{m}}}}{}\cdot{{N_{J_{n}}}}{}\cdot{{N_{P_{k}}}}{}/(16000\cdot 16000)$ for rank- ${{N_{P_{k}}}}{}$ bundle lengths; ${\widetilde{N_{I_{m}}}}{}={\widetilde{N_{J_{n}}}}{}=({{N_{I_{m}}}}{}\cdot{{N_{J_{n}}}}{}\cdot{{N_{P_{k}}}}{}/1024)^{1/2}$ for fixed- ${{N_{P_{k}}}}{}$ bundle lengths.

For the square and rank- ${{N_{P_{k}}}}{}$ tensor shapes on one core, tblis is rapidly outpaced by ABC Strassen, with a crossover point of about $500\approx 2\cdot k_{C}$ . ABC Strassen is then shortly overtaken by AB Strassen and then by two-level AB Strassen. As predicted by the performance model, the AB Strassen implementation is best for very large problem sizes due to repeated updates to $C$ in the ABC Strassen algorithm. The Naive Strassen implementations are never the best in these experiments, although they may become more efficient than AB Strassen for extremely large, square problems. These trends are repeated in the ten-core experiments, although the crossover points are moved to larger tensor sizes.

For the fixed- ${{N_{P_{k}}}}{}$ shapes, total performance is lower for AB Strassen and Naive Strassen with scalability for the algorithms being especially impacted by the relatively smaller ${{N_{I_{m}}}}{}$ and ${{N_{J_{n}}}}{}$ sizes. For these shapes ABC Strassen is always the fastest method above the crossover point with standard tblis.

The actual performance data matches the predicted performance very well, with some variation due to the randomization of the tensor lengths and permutations. Using these performance models, it may be possible to analytically decide on which algorithm to apply for a given tensor contraction to achieve the highest performance, allowing an automated and seamless inclusion of Strassen into a TBLIS-like tensor framework.

Real-world benchmark. In Figure 6, we measure the performance of various implementations for a subset of tensor contractions from the Tensor Contraction Benchmark [30] on single core and one socket. We present representative use cases where ${{N_{P_{k}}}}{}$ is nearly equal to or larger than $2k_{C}$ (512), for which Strassen can show performance benefits, as illustrated in $\S$ 5. The right three test cases represent various regularly-blocked tensor contractions from coupled cluster with single and double excitations (CCSD) [31, 32, 33], a workhorse quantum chemistry computational method. The fourth case from the right illustrates the performance of tblis and Strassen TC for a pure matrix case. Comparing this case and the CCSD contractions highlights some of the performance issues that exist in the current implementation of the packing and matrix-to-block scatter matrix copy kernels (see $\S$ 4.1 for details). On one core, all Strassen implementations improve on tblis for these right four cases, and in parallel one-level Strassen implementations give a speedup as well, exceeding ttt performance especially in the case of AB Strassen. The gap between tblis and ttt for these contractions is due to ttt’s use of Intel’s MKL library, which is more highly optimized than the BLIS/TBLIS framework.

The left two benchmarks are again quantum chemistry applications using 3-D tensors that arise in density-fitting (DF) calculations [34, 35]. These contractions are also structurally equivalent to certain contractions from the coupled cluster with perturbative triples, CCSD(T), method [36], where the occupied (see $\S$ 6.2) indices have been sliced. These cases show the improvement of tblis over ttt as noted in [5], but do not show a speedup from Strassen except for one-level ABC Strassen on one core. Our Strassen implementation performs the submatrix multiplications sequentially, with only parallelization of each submatrix multiplication step. A more comprehensive parallelization scheme, for example using task-based parallelism [9], may show better performance. Additionally, since the DF/CCSD(T) contractions are highly “non-square”, an alternate fast matrix multiplication algorithm [9, 8] may perform better.

Shape-dependence experiments. The performance of the “particle-particle ladder” contraction from CCSD, $\bm{\mathcal{{Z}}}_{abij}+\!\!=\bm{\mathcal{{W}}}_{abef}\cdot\bm{\mathcal{{T}}}_{efij}$ is reported for a range of tensor shapes in Figure 7. In these experiments, the length of the virtual dimensions $\{a,b,e,f\}$ is varied with respect to the length of the occupied dimensions $\{i,j\}$ such that the total number of FLOPs is roughly similarly to a $16000\times 16000$ matrix multiplication, and the ratio $N_{a}\colon N_{i}$ is used as a proxy for tensor shape. A ratio of 1:1 would reflect an extremely poor quality of basis set for the overall calculation, but is common when the calculation employs regular blocking. The other end of the scale, with a ratio of $\thicksim 5:1$ , would then correspond to uneven blocking. This type of blocking allows for better load balancing and lower overhead when $N_{a}$ and $N_{i}$ are very unequal in the overall calculation.

The performance of tblis and all of the one-level Strassen algorithms show essentially no performance degradation across the entire range tested. The two-level Strassen algorithms show some performance degradation at larger ratios, but still show improvement over tblis. Eventually, all Strassen algorithms will cross over and perform worse than tblis, as evidenced by the left two contractions in Figure 6 (these correspond to a ratio of about 22). However, the good performance of Strassen out to reasonably large ratios shows that it could be beneficial in both regular blocking and uneven blocking scenarios.

6.2 Distributed memory experiments

We demonstrate how to use the Strassen TC implementations to accelerate a distributed memory implementation of 4-D tensor contraction that exemplifies the two-particle “ring” terms from CCSD. In our tests we set the length of virtual indices ( $abe$ ) to $10\times$ that of occupied indices ( $ijm$ ), which approximates the use of a triple- $\zeta$ guality basis set. The problem sizes tested here correspond to calculations on systems with 80, 112, 160, 192, and 224 electrons. We use $\bm{\mathcal{{Z}}}_{abij}+\!\!=\bm{\mathcal{{W}}}_{bmej}\bm{\mathcal{{T}}}_{aeim}$ as a demonstration example to show the performance benefit.

We implement a SUMMA-like[37] algorithm for 4-D tensor contraction with MPI. Initially the tensors $\bm{\mathcal{{W}}}$ , $\bm{\mathcal{{T}}}$ , and $\bm{\mathcal{{Z}}}$ are distributed to a $P\times P$ mesh of MPI processes using a 2D block distribution over the $a$ , $b$ , and $e$ dimensions, with the $i$ , $j$ , and $m$ dimensions stored locally (i.e. not distributed). After slicing $\bm{\mathcal{{W}}}$ and $\bm{\mathcal{{T}}}$ along the $e$ dimension, the contraction is broken down into a sequence of contractions of tensor slice pairs,

[TABLE]

such that the $e$ index length for each tensor slice pairs $\{\bm{\mathcal{{W}}}_{e;p},\bm{\mathcal{{T}}}_{e;p}\}$ is $N^{\prime}_{e}$ . For each tensor slice pairs, $\bm{\mathcal{{W}}}_{e;p}$ is broadcast within rows of the mesh, and $\bm{\mathcal{{T}}}_{e;p}$ is broadcast within columns of the mesh. Then a local tensor contration for received tensor slice pairs is performed to update the local block. Here tblis TC and various Strassen TC are used as a drop-in replacement for this local tensor contraction.

We perform the distributed memory experiment on the same machine as the single node experiment. The dual-socket processor has ten cores on each socket. We run one MPI process for each socket, and leverage all ten cores in a socket with thread parallelism for all implementations. Figure 8 reports the weak scalability performance result on up to 640 cores (32 nodes, 64 sockets).

In our experiments on $P\times P$ mesh of sockets (MPI processes), the lengths of virtual indices are set to equal $N_{a}=N_{b}=N_{e}\approx 400\sqrt{P}$ and the lengths of occupied indices are set to equal $N_{m}=N_{i}=N_{j}\approx 40\sqrt{P}$ , which make ${{N_{I_{m}}}}{}={{N_{J_{n}}}}{}={{N_{P_{k}}}}{}\approx 16000\cdot P$ . This guarantees the local memory buffer allocated to $\bm{\mathcal{{Z}}}$ , $\bm{\mathcal{{W}}}$ , $\bm{\mathcal{{T}}}$ is constant. Our experiments verify that the above SUMMA-like algorithm is weakly scalable on this constant local memory setup, regardless of which local TC implementation we use. The local $e$ index length $N^{\prime}_{e}$ is chosen close to $N^{\prime}_{e}=1024/N_{m}$ such that the local TC computations are performed with ${{N_{P_{k}}}}{}=N^{\prime}_{e}\cdot N_{m}\approx 4\cdot k_{C}$ . The tensor slice pairs in the local TC computations matches the shape when ABC Strassen achieves the best performance. Therefore, the one-level and two-level ABC Strassen implementations outperform all other implementations.

We also tested the Cyclops Tensor Framework (CTF) [29] which also uses a SUMMA or nested SUMMA algorithm but with possibly different block sizes and tensor distributions, as well as using the ttdt algorithm for local tensor contractions. We show it here as a reference for state-of-the-art performance.

7 Conclusions

We have presented what we believe to be the first paper to demonstrate how to leverage Strassen’s algorithm for tensor contraction, and have shown practical performance speedup on single core, multicore, and distributed memory implementations. Using a block scatter matrix layout enables us to partition the matrix view of the tensor, instead of the tensor itself, with automatic (implicit) tensor-to-matrix transformation, and the flexibility to facilitate Strassen’s 2D matrix partition to multi-dimensional tensor spaces. Fusing the matrix summation that must be performed for Strassen and the transposition that must be conducted for tensor contraction with the packing and micro-kernel operations inside high-performance implementation of GEMM avoids extra workspace requirements, and reduces the cost of additional memory movement. We provided a performance model which can accurately predict the speedup of the resulting family of algorithms for different tensor shapes, sizes, and permutations. We evaluated our families of implementations for various tensor sizes and shapes on synthetic and real-world datasets, both observing significant speedups comparing to the baseline (tblis) and naive implementations (Naive Strassen), particularly for smaller problem sizes ( ${{N_{I_{m}}}}{},{{N_{J_{n}}}}{},{{N_{P_{k}}}}{}\approx 2k_{C},4k_{C}$ ), and irregular shape ( ${{N_{P_{k}}}}{}$ is much smaller comparing to ${{N_{I_{m}}}}{}$ , ${{N_{J_{n}}}}{}$ ). Together, this work demonstrates Strassen’s algorithm can be applied for tensor contraction with practical performance benefit.

There are several avenues for future work:

•

Higher-level tensor decomposition algorithms [38], such as Tucker decomposition, involve heavy use of tensor contraction. The impact of our performance improvements with Strassen’s algorithm for those algorithms is an interesting question. It may be possible to leverage our performance model to determine the best implementation for the tensor shape these algorithms require.

•

So far, we target dense tensor contraction, which has numerous applications. However, the structure of the tensor operands may be symmetric [39] or sparse [40], which yields a number of new challenges, like more efficient storage or layout format. How to explore those structure patterns and combine with Strassen’s algorithm can be investigated.

•

More levels of Strassen’s algorithm may lose precision due to numerical instability issues. It may be possible to combine with the techniques proposed in Extended and Mixed Precision BLAS [41] to get higher speedup and maintain precision.

•

A number of recent papers explore practical implementations of Strassen-like fast matrix multiplications [9, 8]. How to extend fast matrix multiplication with different partition block sizes for tensor contraction is an open question.

Additional information

Additional information regarding BLIS and related projects can be found at

http://shpc.ices.utexas.edu

Acknowledgments

This work was sponsored in part by the National Science Foundation under grant number ACI-1550493, by Intel Corporation through an Intel Parallel Computing Center grant, and by a gift from Qualcomm. Access to the Maverick supercomputers administered by TACC is gratefully acknowledged. DAM is an Arnold O. Beckman Postdoctoral Fellow. We thank Martin Schatz for his help with distributed memory implementations, and the rest of the SHPC team (http://shpc.ices.utexas.edu) for their supports.

*Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. *

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] K. Goto and R. A. van de Geijn, “Anatomy of a high-performance matrix multiplication,” ACM Trans. Math. Soft. , vol. 34, no. 3, p. 12, May 2008.
2[2] F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly instantiating BLAS functionality,” ACM Trans. Math. Soft. , vol. 41, no. 3, pp. 14:1–14:33, June 2015.
3[3] F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A. van de Geijn, “The BLIS framework: Experiments in portability,” ACM Transactions on Mathematical Software , vol. 42, no. 2, pp. 12:1–12:19, June 2016.
4[4] T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, “Anatomy of high-performance many-threaded matrix multiplication,” in 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014) , 2014.
5[5] D. A. Matthews, “High-performance tensor contraction without transposition,” Co RR , vol. abs/1607.00291, 2016.
6[6] P. Springer and P. Bientinesi, “Design of a high-performance gemm-like tensor-tensor multiplication,” Co RR , vol. abs/1607.00145, 2016.
7[7] J. Huang, T. M. Smith, G. M. Henry, and R. A. van de Geijn, “Strassen’s algorithm reloaded,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 16) . IEEE Press, 2016, pp. 59:1–59:12.
8[8] J. Huang, L. Rice, D. A. Matthews, and R. van de Geijn, “Generating families of practical fast matrix multiplication algorithms,” in 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017) , 2017.