Stencil Computations on Tenstorrent Wormhole
Lorenzo Piarulli, Daniele De Sensi

TL;DR
This paper evaluates the performance of 2D stencil computations on the Tenstorrent Wormhole AI accelerator, revealing current limitations and potential improvements for HPC workloads.
Contribution
It introduces two heterogeneous implementations of stencil computations on Wormhole and analyzes their performance and energy efficiency compared to CPU baseline.
Findings
Wormhole kernel is competitive with CPU when isolated
Axpy implementation consumes less energy for large inputs
Profiling identifies key architectural and software bottlenecks
Abstract
As investment in AI-focused accelerators grows and their deployment in supercomputing facilities expands, understanding whether these architectures can efficiently support traditional scientific kernels is critical for the future of High-Performance Computing. We investigate the mapping of 2D 5-point stencil computations onto the Tenstorrent Wormhole, a RISC-V AI dataflow accelerator. We develop two heterogeneous implementations: Axpy, which decomposes the stencil into element-wise submatrix operations, and MatMul, which reformulates it as a matrix multiplication. While the CPU baseline remains 3x faster end-to-end, profiling reveals that the isolated Wormhole kernel is competitive with CPU execution, with the gap driven by PCIe transfers, device initialization, and host-side preprocessing. Despite slower runtime, Axpy achieves lower energy consumption than the CPU baseline for large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
