Stencil Computations on Tenstorrent Wormhole

Lorenzo Piarulli; Daniele De Sensi

arXiv:2605.07599·cs.DC·May 11, 2026

Stencil Computations on Tenstorrent Wormhole

Lorenzo Piarulli, Daniele De Sensi

PDF

TL;DR

This paper evaluates the performance of 2D stencil computations on the Tenstorrent Wormhole AI accelerator, revealing current limitations and potential improvements for HPC workloads.

Contribution

It introduces two heterogeneous implementations of stencil computations on Wormhole and analyzes their performance and energy efficiency compared to CPU baseline.

Findings

01

Wormhole kernel is competitive with CPU when isolated

02

Axpy implementation consumes less energy for large inputs

03

Profiling identifies key architectural and software bottlenecks

Abstract

As investment in AI-focused accelerators grows and their deployment in supercomputing facilities expands, understanding whether these architectures can efficiently support traditional scientific kernels is critical for the future of High-Performance Computing. We investigate the mapping of 2D 5-point stencil computations onto the Tenstorrent Wormhole, a RISC-V AI dataflow accelerator. We develop two heterogeneous implementations: Axpy, which decomposes the stencil into element-wise submatrix operations, and MatMul, which reformulates it as a matrix multiplication. While the CPU baseline remains 3x faster end-to-end, profiling reveals that the isolated Wormhole kernel is competitive with CPU execution, with the gap driven by PCIe transfers, device initialization, and host-side preprocessing. Despite slower runtime, Axpy achieves lower energy consumption than the CPU baseline for large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.