Strassen's Algorithm for Tensor Contraction
Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn

TL;DR
This paper extends Strassen's algorithm to tensor contraction, a multi-dimensional generalization of matrix multiplication, demonstrating practical speedups by optimizing tensor layout and avoiding transpositions.
Contribution
First practical demonstration of accelerating tensor contraction with Strassen's algorithm using a novel tensor layout and optimized implementation techniques.
Findings
Achieved up to 1.3x speedup on modern architectures
Introduced a Block-Scatter-Matrix tensor layout for efficient computation
Avoided explicit transpositions and extra workspace in implementation
Abstract
Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen's algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassen's algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen's algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Parallel Computing and Optimization Techniques · Algorithms and Data Compression
\TPMargin
5pt
Strassen’s Algorithm for Tensor Contraction
FLAME Working Note #84
Jianyu Huang*†
Devin A. Matthews†
Robert A. van de Geijn*†
*Department of Computer Science
†Institute for Computational Engineering and Sciences
The University of Texas at Austin, Austin, TX 78712
{jianyu@cs., dmatthews@, rvdg@cs.}utexas.edu
( April 3, 2017 )
Abstract
Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen’s algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassen’s algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassen’s algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup.
1 Introduction
Standing on the shoulders of giants. This paper builds upon a number of recent developments: The GotoBLAS algorithm for matrix multiplication (GEMM) [1] that underlies the currently fastest implementations of GEMM for CPUs; The refactoring of the GotoBLAS algorithm as part of the BLAS-like Library Instantiation Software (BLIS) [2, 3], which exposes primitives for implementing BLAS-like operations; The systematic parallelization of the loops that BLIS exposes so that high-performance can be flexibly attained on multicore and many-core architectures [4]; The casting of tensor contraction (TC) in terms of the BLIS primitives [5, 6] without requiring the transposition (permutation) used by traditional implementations; The practical high-performance implementation of the classical Strassen’s algorithm (Strassen) [7] in terms of variants of the BLIS primitives; and the extension of this implementation [8] to a family of Strassen-like algorithms (Fast Matrix Multiplication algorithms) [9]. Together, these results facilitate what we believe to be the first extension of Strassen’s algorithm to TC.
Contributions. This paper describes how to extend Strassen’s algorithm to TC without the explicit transposition of data that inherently incurs significant memory movement and workspace overhead; It provides a performance model for the cost of the resulting family of algorithms; It details the practical implementation of these algorithms, including how to exploit variants of the primitives that underlie BLIS and a data layout to memory for the tensors; It demonstrates practical speedup on modern single core and multicore CPUs; It illustrates how the local Strassen’s TC algorithm improves performance of a simple distributed memory tensor contraction. Together, these results unlock a new frontier for the research and application of Strassen’s algorithm.
Related work. To the best of our knowledge, this work represents the first implementation of Strassen’s algorithm for tensor contraction. In the context of Strassen for matrices, there have been a variety of practical implementations [10, 11, 12, 9], including the closely related implementation of Strassen using the BLIS framework [7] which this paper is based on.
For tensor contraction, recent work on high-performance tensor contraction [5, 6] serves as the motivation and basis for our present work, while other research has focused on algorithms using tensor slicing [13, 14, 15, 16] or on improving the efficiency of the so-called ttdt algorithm for tensor contraction [17, 18, 19, 20], where input tensors and are Transposed (permuted) and then used in a standard dgemm algorithm, with the output then being Transposed and accumulated onto the tensor . ttdt could be used to construct a Strassen algorithm for TC by transposing subtensors into submatrices and vice versa and using a matrix implementation of Strassen instead of dgemm. However, we will show that this algorithm is essentially the same as our Naive Strassen algorithm, which is often less efficient than the other algorithms that we have implemented.
The gett algorithm [6] is a high-performance tensor contraction implementation similar in many ways to the BLIS-based implementation in [5]. As in [5], which our present work is based on, formation of linear combinations of input subtensors of and and output to multiple subtensors of could be fused with the internal tensor transposition and micro-kernel steps of gett. However, the implementation would be restricted to regular subtensors rather than more general submatrices, which could have possible negative performance implications.
2 Background
We briefly review how high-performance GEMM is implemented, before discussing the practical implementations of high-performance Strassen for GEMM.
2.1 High-performance GEMM
Let , , and be matrices of sizes , , and , respectively. A general matrix-matrix multiplication (gemm) in the BLAS interface [21] is expressed as . Written element-wise, , where denotes scalar multiplication, and and are scalars. We focus on the special case and henceforth for brevity.
A key insight underlying modern high-performance implementations of gemm is to organize the computations by partitioning the operands into blocks for temporal locality, and to pack (copy) such blocks into contiguous buffers that fit into various levels of memory for spatial locality. Figure 1(left) illustrates the GotoBLAS algorithm as implemented in BLIS. Cache blocking parameters determine the submatrix sizes of () and (), such that they fit in various caches (we use the standard gemm dimensions in defining blocking parameters for brevity and consistency with [2], but note that the meaning of alone is changed in 2.3). During the computation, row panels are contiguously packed into buffer to fit in the L3 cache. Blocks are similarly packed into buffer to fit in the L2 cache. Register block sizes relate to submatrices in registers that contribute to . In the micro-kernel (the inner most loop), a small micro-tile of is updated by pair of and slivers of and . The above parameters can be analytically chosen [22].
2.2 High-performance Strassen
If the three operands are partitioned into quadrants,
[TABLE]
then it can be checked that the operations
[TABLE]
compute , with seven instead of eight (sub)matrix multiplications, reducing the cost by a factor of (ignoring a lower order number of extra additions). If all matrices are square and of size , theoretically this single step of Strassen can be applied recursively, resulting in the classical Strassen with a cost of .
In practice, only a few levels of the recursion are leveraged because the reduction in computations are quickly overwhelmed by the cost of extra additions and extra memory movements. Additionally, Strassen is known to experience degradation in numerical stability especially when more than two levels of recursion are incorporated [23, 24, 25].
Figure 1(right) illustrates the modifications done in [7] to make Strassen practical. During the packing process, the additions of the submatrices and can be incorporated into the packing buffers and , avoiding extra memory movement and reducing workspace requirements. In the micro-kernel, once a submatrix that contributes to is computed in machine registers, it can be directly added to the appropriate parts of multiple submatrices of , thus avoiding the need for temporary intermediate matrices , again avoiding extra memory movement. As demonstrated in [7], this approach makes Strassen practical for smaller matrices and matrices of special shape (importantly, for rank-k updates, where is relatively small comparing to and ). This research is pushed further [8] by revealing that Strassen performs relatively better than most other Strassen-like FMM algorithms with one or two levels of recursions, when modeled as well as in practice. For this reason, we do not extend those FMM algorithms to TC in this paper, although it may be worthwhile in future work to pursue certain of these algorithms for highly non-square tensor contraction shapes.
2.3 High-performance Tensor Contraction
The definition and notation of tensors and tensor contraction are briefly reviewed before describing the tensor layouts that enable high-performance tensor contraction.
Tensor. The concept of matrices is extended to multiple dimensions by defining a general -D tensor as a multidimensional array of scalar elements, where the length of the -th dimension is given by . Individual elements are referenced by indexing by an ordered index bundle , such that for all . In general we will denote the dimension of a tensor as , the index length as the length of the dimension that is indexed by some symbol , and the bundle length as the total length of a index bundle , i.e. .
Tensor Contraction. Let , , and be general tensors of any dimensionality satisfying . Then, let , , and be index bundles with and . Lastly, let the index reordering be defined by the bijective map , and similarly for and . The general definition of tensor contraction is then given by,
[TABLE]
for scalars . The indices in the bundles and are generally called free, external, or uncontracted indices, while the indices in the bundle are called bound, internal, or contracted indices. In the following we will assume that and , and suppress the explicit summation over . The number of leading-order floating point operations required for tensor contraction is = . If the length of each dimension is , the tensor contraction operation requires flops.
In Figure 2(a), the tensor contraction is illustrated. In the general notation this gives , , , , , and . The number of floating point operations and memory accesses for this contraction is identical to that for a matrix multiplication of , , and matrices.
General stride layouts. The well-known column-major and row-major matrix layouts may be extended to tensors as the generalized column- and row-major tensor layouts, where elements are stored contiguously along the first dimension or last dimension, respectively. However, in general we may assume only a general tensor layout, which extends the general matrix layout [2] by replacing matrix row and column strides ( and ) with a stride associated to each tensor dimension. For a -dimensional tensor indexed by , the strides for all form the set , which gives element LOCations relative to ,
[TABLE]
For convenience, we may also refer to the stride of the dimension indexed in by a particular symbol as . The generalized column-major and row-major layouts can also be represented using a general stride layout, in which case and , respectively.
In Figure 2(a), is stored in the generalized column-major layout. The entries represents the location of the element relative to the element in the tensor storage layout. , , and . The element location of is .
Block Scatter Matrix View. In [5] it is shown that tensors can be represented in a matrix-centric layout that allows for a simple but efficient implementation of tensor contraction using the BLIS framework. The main idea of that work is that the locations of tensor elements of can be described in a matrix format, the scatter matrix layout, for some matrix very similarly to the general stride matrix layout,
[TABLE]
where and . If we define the index bundle of size as the set of indices of that map to columns of , and the index bundle of size (such that ) as the set of indices that map to rows of , then by inspection of the general stride layout we can see that the scatter vector with respect to is given by,
[TABLE]
and similarly for with respect to .
The relative location of in Figure 2(a), or in the matrix view of in Figure 2(b) is (e.g, ). Here: (1) ; (2) . These scatter vectors are shown on top and left of the matrix view of in Figure 2(b).
The general definition of tensor contractions gives a natural mapping from tensors to matrices through the index bundles , , and . Thus, the bundle defines and , defines and , and defines and . If we define matrices , , and and imbue them with scatter matrix layouts using the scatter vectors from the corresponding tensors, we can perform tensor contraction using the high-performance matrix multiplication algorithm introduced in 2.1, without explicitly forming those matrices in extra working buffers.
Since we are using the GotoBLAS/BLIS algorithm, we can leverage the fact that these matrices will be partitioned to introduce further optimizations. In the micro-kernel (Figure 1), the matrix will be partitioned into blocks and the matrices and will be partitioned into and slivers, respectively. If we further partition into smaller increments of a new parameter , on the order of and , then we will end up with only matrix blocks of very small size. As in [5], we can partition the scatter vectors into very small blocks of size , , and as well, and use optimized algorithms in the packing kernels and micro-kernel when the scatter values for the current block are regularly spaced (i.e. strided). The regular strides for each //-sized block of /, or zero if no regular stride exists, are collected in a row/column block scatter vector / of length // and similarly for the other row/column scatter vectors. With these block scatter vectors, we can then utilize efficient SIMD vector load/store instructions for stride-one index, or vector gather/scatter fetch instructions for stride- index, in a favorable memory access pattern.
In Figure 2(b), assuming , , and , since the regular strides for each 4 elements of and are 1 and 4, respectively.
3 Strassen for Tensor Contraction
The operations summarized in 2.2 are all special cases of
[TABLE]
for appropriately chosen . Here, and are submatrices of , and are submatrices of , and and are submatrices of the original . As in [7], this scheme can be extended to multiple levels of Strassen.
Instead of partitioning the tensor into subtensors and and so on for and , we partition the matrix representations , , and as in the matrix implementation of Strassen. Figure 2 provides an example to illustrate the partition mechanism. Block scatter matrix layouts for these submatrices may be trivially obtained by partitioning the scatter- and block scatter vectors of the entire matrices along the relevant dimensions. Once imbued with the appropriate layouts, these submatrices may then be used in the BLIS-based Strassen of [7] along with modifications the the packing kernels and micro-kernel as in [5].
In fusing these two methodologies, we need to further address the consideration of multiple block scatter vectors as required when packing and executing the micro-kernel. Methods for dealing with this issue are described in 4.1. The advantage of using matrix partitions (which is enabled by the block scatter layout) instead of tensor partitions is primarily that only the product of the lengths of each index bundle, , , , must be considered when partitioning, and not the lengths of individual tensor dimensions. For example, Strassen may be applied to any tensor contraction where at least one dimension in each bundle is even in our approach, whereas the dimension (or rather, the dimension with the longest stride) must be even when using subtensors.111A dimension other than the last could also be chosen for partitioning, but the spatial locality of the partitioning would be destroyed. Additionally, when applying methods for performing Strassen on odd-length matrices to tensors, such as dynamical peeling as in [7] or zero-padding, the overhead is larger for subtensors since a single dimension must be padded or peeled rather than the entire index bundle.
4 Implementations
The modifications to the block scatter matrix-based packing kernel and micro-kernel as described in [5] for Strassen are detailed.
4.1 Packing
When packing submatrices for Strassen using (4), multiple scatter- and block scatter vectors must be considered. In our initial implementation, the block scatter vector entries for the corresponding block in both input submatrices (or all submatrices for -level Strassen) are examined. If all entries are non-zero, then the constant stride is used in packing the current block.222Note that when non-zero, the block scatter vector entries for different submatrices will always be equal. Otherwise, the scatter vectors are used when packing the current block, even though one or more of the input submatrix blocks may in fact have a regular stride. In future work, we plan to exploit these cases for further performance improvements.
In addition to the ABC Strassen algorithm, we also implement the AB Strassen and Naive Strassen algorithms of [7] for tensor contraction. In the AB Strassen algorithm, intermediate submatrices are explicitly stored and then accumulated into submatrices of . We store the submatrices as regular, densely-stored matrices, and handle their accumulation onto block scatter matrix layout submatrices of using an adapted version of the Strassen block scatter matrix packing kernel. In the Naive Strassen algorithm, submatrices of and are also explicitly copied using a modified packing kernel and stored as regular submatrices. Thus, the Naive Strassen algorithm for tensor contraction is extremely similar to a ttdt-based Strassen algorithm (see 1), except that the tensors are not required to be partitioned into regular subtensors.
4.2 Micro-kernel
As in [7], we use assembly-coded micro-kernels that include the update to several submatrices of from registers. In order to use this efficient update, block scatter vector entries for the relevant submatrix blocks of must be non-zero. Unlike in the packing kernel implementation, the case where only one or more of the submatrix blocks is regular stride would be more difficult to take advantage of, as the micro-kernel would have to be modified to flexibly omit or redirect individual submatrix updates.
5 Performance Model
In [7], a performance model was proposed to predict the execution time for variations of Strassen for matrices. In this section, we extend that performance model to estimate the execution time of ABC, AB and Naive variations of -level Strassen for TC and the high-performance TC routine we build on (see 2.3; using TBLIS implementation [5, 26] introduced in 6; denoted as tblis henceforth). Due to the high dimensionality of tensors and enormous types and combinations of permutations (transpositions) in TC, it is impractical to exhaustively search for every tensor shape and tensor problem size to find the best variation. Performance modeling helps us to better understand the memory footprint and computation of different Strassen implementations for TC, and at least reduce the search space to pick the right implementation. In our model, besides input problem size, block sizes, and the hardware parameters such as the peak GFLOPS and bandwidth, also depends on the shape of the tensors, and the extra permutations (transpositions) in the packing routines and in the micro-kernel.
Notations. We summarize our notations in Figure 3. The total execution time, , can be decomposed of arithmetic time and memory time (\raisebox{-0.9pt}{2}⃝ in Figure 4).
Arithmetic operations. As shown in \raisebox{-0.9pt}{3}⃝, includes (sub)tensor contraction () and (sub)tensor additions/permutations (, , ). The corresponding coefficients for tblis TC and -level various Strassen TC are enumerated in Figure 4. Note that is calculated by multiplying the unit time with the arithmetic operation number in the middle table of Figure 4. We compute through \raisebox{-0.9pt}{5}⃝. The penalty factor is introduced, due to the extra computations involved in ///, and the slow micro-kernel invocation when the corresponding entries in or are [math] (see 4.2; non-regular stride access). We penalize the performance drops caused by these factors by setting .
Memory operations. Similar to [7], we assume two layers of modern memory hierarchy: slow main memory and fast caches. For write operations, the lazy write-back policy is enforced such that the time for writing into fast caches can be hidden. For read operations, the latency for accessing the slow main memory is counted, while the latency for accessing caches can be ignored. With these assumptions, can be broken down into three parts (\raisebox{-0.9pt}{4}⃝ in Figure 4): updating the temporary buffer that are parts of Naive Strassen/AB Strassen (); memory packing shown in Figure 1 ( , ) ; updating the submatrices of shown in Figure 1 (). The coefficients are tabulated in Figure 4. is a function of block sizes in Figure 1, and the bundle lengths because the memory operation can repeat multiple times according to which loop they reside in. Figure 4(middle) characterizes each memory operation term by its read/write type and the amount of memory in units of 64-bit double precision elements. , are omitted in \raisebox{-0.9pt}{4}⃝ due to the lazy write-back policy assumption. Because of the software prefetching effects, there is an extra parameter for , which denotes the prefetching efficiency. In order to get , the memory operation number needs to be multiplied by the bandwidth . We compute through \raisebox{-0.9pt}{6}⃝. We penalize the effect of permutations without stride-one index accesss (see 4.1; the corresponding entries in neither or are 1, i.e. using scatter/gather operation, or indirect memory addressing with (3)) by setting . A similar parameter is introduced in [6] for regular TC.
Discussion We can estimate the run time performance of various implementations, based on the performance model presented in Figure 4. Here we define Effective GFLOPS (\raisebox{-0.9pt}{1}⃝ in Figure 4) for TC as the metric to compare the performance of various Strassen TC and tblis TC. The theoretical peak GFLOPS and bandwidth information is given in 6. In Figure 5(left), we demonstrate the modeled and actual performance for a wide range of synthetic tensor sizes and shapes: ; , varies; , vary. How we generate synthetic data is detailed in 6.
- •
For , the ABC Strassen/AB Strassen implementations outperform tblis, when , , are as small as , nearly 500; while Naive Strassen cannot beat tblis until the problem size is larger than 2000.
- •
The “, varies” graph shows that when is small, ABC Strassen performs best; when is large, AB Strassen performs better. The coefficients in Figure 4(bottom) help to illustrate the reasons quantitatively.
- •
According to the model, when is equal to appropriate multiple of ( for -level), ABC Strassen achieves the best performance. We will leverage this observation in our distributed memory experiment.
6 Experiments
We perform our experimental evaluations for synthetic data and real-world benchmarks on a single node and on a distributed memory architecture. The implementations are written in C++, utilizing AVX assembly, based on the open source TBLIS framework [26]. We compare against TBLIS’s tensor contraction routine (marked as tblis) as well as the TTT routine from MATLAB Tensor Toolbox [27] (linked with Intel MKL [28], marked as ttt) for single node, and tensor contraction routine from the Cyclops Tensor Framework [29] (also linked with Intel MKL, marked with ctf) for distributed memory.
We measure the CPU performance results on the Maverick system at the Texas Advanced Computing Center (TACC). Each node of that system consists of a dual-socket (10 cores/socket) Intel Xeon E5-2680 v2 (Ivy Bridge) processors with 256 GB memory (peak bandwidth: 59.7 GB/s with four channels) and a three-level cache (32 KB L1 data; 256 KB L2; 25.6 MB L3). The stable CPU clockrate is 3.54 GHz when a single core is utilized (28.32 GFLOPS peak, marked in the graphs) and 3.10 GHz when all ten cores are in use (24.8 GFLOPS/core peak). We disable hyper-threading explicitly and set thread affinity with KMP_AFFINITY=compact which also ensures the computation and the memory allocation all reside on the same socket.
The cache blocking parameters, , , , and the register block sizes, , , are consistent with parameters used for the standard BLIS dgemm implementation for this architecture. We use the default value of as defined in TBLIS. This makes the size of the packing buffer 192 KB and 8192 KB, which then fit the L2 cache and L3 cache, respectively. Parallelization is implemented mirroring that described in [4], but with the number of threads assigned to each of the loops in Figure 1 automatically determined by the TBLIS framework.
6.1 Single node experiments
Synthetic tensor contractions. To evaluate the overall performance of various Strassen TC comparing against tblis TC for different tensor problem sizes, shapes, and permutations, we randomly generate TC test cases with 2-D to 6-D randomly permuted tensors as operands, and test all these implementations for each synthetic test case. We choose step size to sample uniformly for various tensor bundle lengths: square: ; rank-: , varies; fixed-: , vary. For each bundle length , we randomly generate three 1-D, 2-D, or 3-D bundles, such that the product of each index length is close to . The order of is then randomly permuted.
The generated bundle lengths may not exactly match the original sampled bundle lengths. When we plot the actual performance of these synthetic test cases, we set for the square bundle lengths; for rank- bundle lengths; for fixed- bundle lengths.
For the square and rank- tensor shapes on one core, tblis is rapidly outpaced by ABC Strassen, with a crossover point of about . ABC Strassen is then shortly overtaken by AB Strassen and then by two-level AB Strassen. As predicted by the performance model, the AB Strassen implementation is best for very large problem sizes due to repeated updates to in the ABC Strassen algorithm. The Naive Strassen implementations are never the best in these experiments, although they may become more efficient than AB Strassen for extremely large, square problems. These trends are repeated in the ten-core experiments, although the crossover points are moved to larger tensor sizes.
For the fixed- shapes, total performance is lower for AB Strassen and Naive Strassen with scalability for the algorithms being especially impacted by the relatively smaller and sizes. For these shapes ABC Strassen is always the fastest method above the crossover point with standard tblis.
The actual performance data matches the predicted performance very well, with some variation due to the randomization of the tensor lengths and permutations. Using these performance models, it may be possible to analytically decide on which algorithm to apply for a given tensor contraction to achieve the highest performance, allowing an automated and seamless inclusion of Strassen into a TBLIS-like tensor framework.
Real-world benchmark. In Figure 6, we measure the performance of various implementations for a subset of tensor contractions from the Tensor Contraction Benchmark [30] on single core and one socket. We present representative use cases where is nearly equal to or larger than (512), for which Strassen can show performance benefits, as illustrated in 5. The right three test cases represent various regularly-blocked tensor contractions from coupled cluster with single and double excitations (CCSD) [31, 32, 33], a workhorse quantum chemistry computational method. The fourth case from the right illustrates the performance of tblis and Strassen TC for a pure matrix case. Comparing this case and the CCSD contractions highlights some of the performance issues that exist in the current implementation of the packing and matrix-to-block scatter matrix copy kernels (see 4.1 for details). On one core, all Strassen implementations improve on tblis for these right four cases, and in parallel one-level Strassen implementations give a speedup as well, exceeding ttt performance especially in the case of AB Strassen. The gap between tblis and ttt for these contractions is due to ttt’s use of Intel’s MKL library, which is more highly optimized than the BLIS/TBLIS framework.
The left two benchmarks are again quantum chemistry applications using 3-D tensors that arise in density-fitting (DF) calculations [34, 35]. These contractions are also structurally equivalent to certain contractions from the coupled cluster with perturbative triples, CCSD(T), method [36], where the occupied (see 6.2) indices have been sliced. These cases show the improvement of tblis over ttt as noted in [5], but do not show a speedup from Strassen except for one-level ABC Strassen on one core. Our Strassen implementation performs the submatrix multiplications sequentially, with only parallelization of each submatrix multiplication step. A more comprehensive parallelization scheme, for example using task-based parallelism [9], may show better performance. Additionally, since the DF/CCSD(T) contractions are highly “non-square”, an alternate fast matrix multiplication algorithm [9, 8] may perform better.
Shape-dependence experiments. The performance of the “particle-particle ladder” contraction from CCSD, is reported for a range of tensor shapes in Figure 7. In these experiments, the length of the virtual dimensions is varied with respect to the length of the occupied dimensions such that the total number of FLOPs is roughly similarly to a matrix multiplication, and the ratio is used as a proxy for tensor shape. A ratio of 1:1 would reflect an extremely poor quality of basis set for the overall calculation, but is common when the calculation employs regular blocking. The other end of the scale, with a ratio of , would then correspond to uneven blocking. This type of blocking allows for better load balancing and lower overhead when and are very unequal in the overall calculation.
The performance of tblis and all of the one-level Strassen algorithms show essentially no performance degradation across the entire range tested. The two-level Strassen algorithms show some performance degradation at larger ratios, but still show improvement over tblis. Eventually, all Strassen algorithms will cross over and perform worse than tblis, as evidenced by the left two contractions in Figure 6 (these correspond to a ratio of about 22). However, the good performance of Strassen out to reasonably large ratios shows that it could be beneficial in both regular blocking and uneven blocking scenarios.
6.2 Distributed memory experiments
We demonstrate how to use the Strassen TC implementations to accelerate a distributed memory implementation of 4-D tensor contraction that exemplifies the two-particle “ring” terms from CCSD. In our tests we set the length of virtual indices () to that of occupied indices (), which approximates the use of a triple- guality basis set. The problem sizes tested here correspond to calculations on systems with 80, 112, 160, 192, and 224 electrons. We use as a demonstration example to show the performance benefit.
We implement a SUMMA-like[37] algorithm for 4-D tensor contraction with MPI. Initially the tensors , , and are distributed to a mesh of MPI processes using a 2D block distribution over the , , and dimensions, with the , , and dimensions stored locally (i.e. not distributed). After slicing and along the dimension, the contraction is broken down into a sequence of contractions of tensor slice pairs,
[TABLE]
such that the index length for each tensor slice pairs is . For each tensor slice pairs, is broadcast within rows of the mesh, and is broadcast within columns of the mesh. Then a local tensor contration for received tensor slice pairs is performed to update the local block. Here tblis TC and various Strassen TC are used as a drop-in replacement for this local tensor contraction.
We perform the distributed memory experiment on the same machine as the single node experiment. The dual-socket processor has ten cores on each socket. We run one MPI process for each socket, and leverage all ten cores in a socket with thread parallelism for all implementations. Figure 8 reports the weak scalability performance result on up to 640 cores (32 nodes, 64 sockets).
In our experiments on mesh of sockets (MPI processes), the lengths of virtual indices are set to equal and the lengths of occupied indices are set to equal , which make . This guarantees the local memory buffer allocated to , , is constant. Our experiments verify that the above SUMMA-like algorithm is weakly scalable on this constant local memory setup, regardless of which local TC implementation we use. The local index length is chosen close to such that the local TC computations are performed with . The tensor slice pairs in the local TC computations matches the shape when ABC Strassen achieves the best performance. Therefore, the one-level and two-level ABC Strassen implementations outperform all other implementations.
We also tested the Cyclops Tensor Framework (CTF) [29] which also uses a SUMMA or nested SUMMA algorithm but with possibly different block sizes and tensor distributions, as well as using the ttdt algorithm for local tensor contractions. We show it here as a reference for state-of-the-art performance.
7 Conclusions
We have presented what we believe to be the first paper to demonstrate how to leverage Strassen’s algorithm for tensor contraction, and have shown practical performance speedup on single core, multicore, and distributed memory implementations. Using a block scatter matrix layout enables us to partition the matrix view of the tensor, instead of the tensor itself, with automatic (implicit) tensor-to-matrix transformation, and the flexibility to facilitate Strassen’s 2D matrix partition to multi-dimensional tensor spaces. Fusing the matrix summation that must be performed for Strassen and the transposition that must be conducted for tensor contraction with the packing and micro-kernel operations inside high-performance implementation of GEMM avoids extra workspace requirements, and reduces the cost of additional memory movement. We provided a performance model which can accurately predict the speedup of the resulting family of algorithms for different tensor shapes, sizes, and permutations. We evaluated our families of implementations for various tensor sizes and shapes on synthetic and real-world datasets, both observing significant speedups comparing to the baseline (tblis) and naive implementations (Naive Strassen), particularly for smaller problem sizes (), and irregular shape ( is much smaller comparing to , ). Together, this work demonstrates Strassen’s algorithm can be applied for tensor contraction with practical performance benefit.
There are several avenues for future work:
- •
Higher-level tensor decomposition algorithms [38], such as Tucker decomposition, involve heavy use of tensor contraction. The impact of our performance improvements with Strassen’s algorithm for those algorithms is an interesting question. It may be possible to leverage our performance model to determine the best implementation for the tensor shape these algorithms require.
- •
So far, we target dense tensor contraction, which has numerous applications. However, the structure of the tensor operands may be symmetric [39] or sparse [40], which yields a number of new challenges, like more efficient storage or layout format. How to explore those structure patterns and combine with Strassen’s algorithm can be investigated.
- •
More levels of Strassen’s algorithm may lose precision due to numerical instability issues. It may be possible to combine with the techniques proposed in Extended and Mixed Precision BLAS [41] to get higher speedup and maintain precision.
- •
A number of recent papers explore practical implementations of Strassen-like fast matrix multiplications [9, 8]. How to extend fast matrix multiplication with different partition block sizes for tensor contraction is an open question.
Additional information
Additional information regarding BLIS and related projects can be found at
Acknowledgments
This work was sponsored in part by the National Science Foundation under grant number ACI-1550493, by Intel Corporation through an Intel Parallel Computing Center grant, and by a gift from Qualcomm. Access to the Maverick supercomputers administered by TACC is gratefully acknowledged. DAM is an Arnold O. Beckman Postdoctoral Fellow. We thank Martin Schatz for his help with distributed memory implementations, and the rest of the SHPC team (http://shpc.ices.utexas.edu) for their supports.
*Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. *
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] K. Goto and R. A. van de Geijn, “Anatomy of a high-performance matrix multiplication,” ACM Trans. Math. Soft. , vol. 34, no. 3, p. 12, May 2008.
- 2[2] F. G. Van Zee and R. A. van de Geijn, “BLIS: A framework for rapidly instantiating BLAS functionality,” ACM Trans. Math. Soft. , vol. 41, no. 3, pp. 14:1–14:33, June 2015.
- 3[3] F. G. Van Zee, T. Smith, F. D. Igual, M. Smelyanskiy, X. Zhang, M. Kistler, V. Austel, J. Gunnels, T. M. Low, B. Marker, L. Killough, and R. A. van de Geijn, “The BLIS framework: Experiments in portability,” ACM Transactions on Mathematical Software , vol. 42, no. 2, pp. 12:1–12:19, June 2016.
- 4[4] T. M. Smith, R. A. van de Geijn, M. Smelyanskiy, J. R. Hammond, and F. G. Van Zee, “Anatomy of high-performance many-threaded matrix multiplication,” in 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014) , 2014.
- 5[5] D. A. Matthews, “High-performance tensor contraction without transposition,” Co RR , vol. abs/1607.00291, 2016.
- 6[6] P. Springer and P. Bientinesi, “Design of a high-performance gemm-like tensor-tensor multiplication,” Co RR , vol. abs/1607.00145, 2016.
- 7[7] J. Huang, T. M. Smith, G. M. Henry, and R. A. van de Geijn, “Strassen’s algorithm reloaded,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 16) . IEEE Press, 2016, pp. 59:1–59:12.
- 8[8] J. Huang, L. Rice, D. A. Matthews, and R. van de Geijn, “Generating families of practical fast matrix multiplication algorithms,” in 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2017) , 2017.
