WaferLLM: Large Language Model Inference at Wafer Scale

Congjie He; Yeqi Huang; Pei Mu; Ziming Miao; Jilong Xue; Lingxiao Ma; Fan Yang; Luo Mai

arXiv:2502.04563·cs.LG·June 2, 2025

WaferLLM: Large Language Model Inference at Wafer Scale

Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai

PDF

Open Access 1 Repo

TL;DR

WaferLLM is a pioneering system that fully exploits wafer-scale AI accelerators for large language model inference, achieving significant speedups and efficiency improvements over traditional GPU-based systems.

Contribution

It introduces WaferLLM, the first wafer-scale LLM inference system, with novel parallelism and GEMM/GEMV implementations tailored for wafer-scale hardware.

Findings

01

Achieves up to 200× higher utilization than state-of-the-art methods.

02

Delivers GEMV operations 606× faster and 16× more energy-efficient than NVIDIA A100 GPU.

03

Provides 10-20× speedups for full LLM inference over GPU clusters.

Abstract

Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh architecture with large distributed on-chip memory (tens of GB in total) and ultra-high on-chip memory bandwidth (tens of PB/s). However, current LLM inference systems, optimized for shared memory architectures like GPUs, fail to exploit these accelerators fully. We introduce WaferLLM, the first wafer-scale LLM inference system. WaferLLM is guided by a novel PLMR model (pronounced as "Plummer") that captures the unique hardware characteristics of wafer-scale architectures. Leveraging this model, WaferLLM pioneers wafer-scale LLM parallelism, optimizing the utilization of hundreds of thousands of on-chip cores. It also introduces MeshGEMM and MeshGEMV, the first GEMM and GEMV implementations designed to scale effectively on wafer-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

meshinfra/waferllm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvancements in Photolithography Techniques · Integrated Circuits and Semiconductor Failure Analysis · Silicon and Solar Cell Technologies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ADaptive gradient method with the OPTimal convergence rate