Easy Acceleration with Distributed Arrays

Jeremy Kepner; Chansup Byun; LaToya Anderson; William Arcand; David Bestor; William Bergeron; Alex Bonn; Daniel Burrill; Vijay Gadepally; Ryan Haney; Michael Houle; Matthew Hubbell; Hayden Jananthan; Michael Jones; Piotr Luszczek; Lauren Milechin; Guillermo Morales; Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Charles Yee; Peter Michaleas

arXiv:2508.17493·cs.DC·October 21, 2025

Easy Acceleration with Distributed Arrays

Jeremy Kepner, Chansup Byun, LaToya Anderson, William Arcand, David Bestor, William Bergeron, Alex Bonn, Daniel Burrill, Vijay Gadepally, Ryan Haney, Michael Houle, Matthew Hubbell, Hayden Jananthan, Michael Jones, Piotr Luszczek, Lauren Milechin, Guillermo Morales, Julie Mullen

PDF

TL;DR

This paper demonstrates that distributed arrays enable scalable high-performance computing across diverse hardware, achieving near-linear scaling and massive bandwidth on supercomputing infrastructure.

Contribution

It provides an empirical evaluation of distributed array performance across multiple hardware generations, highlighting scalability and hardware improvements over time.

Findings

01

Horizontal scaling across nodes was linear.

02

Achieved over 1 PB/s bandwidth on supercomputing infrastructure.

03

Documented 10x, 100x, and 5x increases in memory bandwidth over 20 years.

Abstract

High level programming languages and GPU accelerators are powerful enablers for a wide range of applications. Achieving scalable vertical (within a compute node), horizontal (across compute nodes), and temporal (over different generations of hardware) performance while retaining productivity requires effective abstractions. Distributed arrays are one such abstraction that enables high level programming to achieve highly scalable performance. Distributed arrays achieve this performance by deriving parallelism from data locality, which naturally leads to high memory bandwidth efficiency. This paper explores distributed array performance using the STREAM memory bandwidth benchmark on a variety of hardware. Scalable performance is demonstrated within and across CPU cores, CPU nodes, and GPU nodes. Horizontal scaling across multiple nodes was linear. The hardware used spans decades and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.