Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10
Yiqi Liu, Yuqi Xue, Yu Cheng, Lingxiao Ma, Ziming Miao, Jilong Xue and, Jian Huang

TL;DR
T10 is a novel deep learning compiler designed for inter-core connected AI chips, enabling efficient parallel computation and communication, resulting in significant performance gains and scalability improvements.
Contribution
T10 introduces a distributed tensor abstraction and a generalized compute-shift pattern to optimize tensor computations over inter-core connections in AI chips.
Findings
Up to 3.3× performance improvement over existing compilers.
Supports larger models with improved scalability.
Effectively reduces unnecessary inter-core communication.
Abstract
As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, inter-core communication is enabled recently by employing high-bandwidth and low-latency interconnect links on the chip (e.g., Graphcore IPU). It allows each core to directly access the fast scratchpad memory in other cores, which enables new parallel computing paradigms. However, without proper support for the scalable inter-core connections in current DL compilers, it is hard for developers to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core communication bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in this new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a generalized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
