An MLIR Lowering Pipeline for Stencils at Wafer-Scale

Nicolai Stawinoga; David Katz; Anton Lydike; Justs Zarins; Nick Brown; George Bisbas; Tobias Grosser

arXiv:2601.17754·cs.DC·January 27, 2026

An MLIR Lowering Pipeline for Stencils at Wafer-Scale

Nicolai Stawinoga, David Katz, Anton Lydike, Justs Zarins, Nick Brown, George Bisbas, Tobias Grosser

PDF

Open Access

TL;DR

This paper introduces a compiler pipeline that automatically transforms stencil-based HPC kernels into optimized code for the Cerebras WSE, achieving significant performance gains without application code modifications.

Contribution

The paper presents a novel MLIR-based lowering pipeline that enables automatic targeting of the WSE for stencil computations, bridging the gap between mathematical models and hardware execution.

Findings

01

Performance on WSE3 is 14x faster than 128 Nvidia A100 GPUs.

02

Performance on WSE3 is 20x faster than 128 nodes of a CPU supercomputer.

03

The approach matches or exceeds manually optimized code performance.

Abstract

The Cerebras Wafer-Scale Engine (WSE) delivers performance at an unprecedented scale of over 900,000 compute units, all connected via a single-wafer on-chip interconnect. Initially designed for AI, the WSE architecture is also well-suited for High Performance Computing (HPC). However, its distributed asynchronous programming model diverges significantly from the simple sequential or bulk-synchronous programs that one would typically derive for a given mathematical program description. Targeting the WSE requires a bespoke re-implementation when porting existing code. The absence of WSE support in compilers such as MLIR, meant that there was little hope for automating this process. Stencils are ubiquitous in HPC, and in this paper we explore the hypothesis that domain specific information about stencils can be leveraged by the compiler to automatically target the WSE without requiring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Interconnection Networks and Systems