TTC: A Tensor Transposition Compiler for Multiple Architectures
Paul Springer, Aravind Sankaran, Paolo Bientinesi

TL;DR
TTC is a domain-specific compiler that efficiently transposes tensors of arbitrary dimensions across multiple architectures, significantly outperforming general-purpose compilers and enabling high-performance tensor operations.
Contribution
It introduces TTC, a parallel tensor transposition compiler that generates optimized code for various architectures, outperforming standard compilers and supporting complex tensor dimensions.
Findings
TTC achieves up to 8x speedup on Haswell.
TTC achieves up to 32x speedup on Knights Corner.
Supports multiple leading dimensions for BLAS 3 routines.
Abstract
We consider the problem of transposing tensors of arbitrary dimension and describe TTC, an open source domain-specific parallel compiler. TTC generates optimized parallel C++/CUDA C code that achieves a significant fraction of the system's peak memory bandwidth. TTC exhibits high performance across multiple architectures, including modern AVX-based systems (e.g.,~Intel Haswell, AMD Steamroller), Intel's Knights Corner as well as different CUDA-based GPUs such as NVIDIA's Kepler and Maxwell architectures. We report speedups of TTC over a meaningful baseline implementation generated by external C++ compilers; the results suggest that a domain-specific compiler can outperform its general purpose counterpart significantly: For instance, comparing with Intel's latest C++ compiler on the Haswell and Knights Corner architecture, TTC yields speedups of up to and ,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
