TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips

Huizheng Wang; Taiquan Wei; Zichuan Wang; Dingcheng Jiang; Qize Yang; Jiaxin Liu; Jingxiang Hou; Chao Li; Jinyi Deng; Yang Hu; Shouyi Yin

arXiv:2512.14256·cs.AR·December 17, 2025

TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips

Huizheng Wang, Taiquan Wei, Zichuan Wang, Dingcheng Jiang, Qize Yang, Jiaxin Liu, Jingxiang Hou, Chao Li, Jinyi Deng, Yang Hu, Shouyi Yin

PDF

Open Access

TL;DR

TEMP is a framework that enhances large language model training on wafer-scale chips by optimizing tensor partitioning and mapping to overcome hardware constraints, significantly improving throughput.

Contribution

It introduces the tensor stream partition paradigm and a topology-aware, traffic-conscious mapping approach tailored for wafer-scale chips, addressing memory and communication challenges.

Findings

01

Achieves 1.7x average throughput improvement over state-of-the-art systems.

02

Effectively manages memory and communication bottlenecks in wafer-scale chip environments.

03

Demonstrates scalability across various large language models.

Abstract

Large language models (LLMs) demand significant memory and computation resources. Wafer-scale chips (WSCs) provide high computation power and die-to-die (D2D) bandwidth but face a unique trade-off between on-chip memory and compute resources due to limited wafer area. Therefore, tensor parallelism strategies for wafer should leverage communication advantages while maintaining memory efficiency to maximize WSC performance. However, existing approaches fail to address these challenges. To address these challenges, we propose the tensor stream partition paradigm (TSPP), which reveals an opportunity to leverage WSCs' abundant communication bandwidth to alleviate stringent on-chip memory constraints. However, the 2D mesh topology of WSCs lacks long-distance and flexible interconnects, leading to three challenges: 1) severe tail latency, 2) prohibitive D2D traffic contention, and 3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInterconnection Networks and Systems · Parallel Computing and Optimization Techniques · VLSI and FPGA Design Techniques