Network-Accelerated Non-Contiguous Memory Transfers
Salvatore Di Girolamo, Konstantin Taranov, Andreas Kurth, Michael, Schaffner, Timo Schneider, Jakub Ber\'anek, Maciej Besta, Luca Benini, Duncan, Roweth, Torsten Hoefler

TL;DR
This paper demonstrates that non-contiguous memory transfers in HPC applications can be significantly accelerated using network offloading, achieving up to 10x throughput improvements and enabling truly zero-copy communications.
Contribution
It introduces a method to transparently offload non-contiguous memory transfers to NICs using sPIN, enabling network acceleration of MPI datatype processing.
Findings
Up to 10x speedup in unpack throughput for real applications.
Non-contiguous transfers are viable candidates for network acceleration.
Implementation of sPIN within a Portals 4 NIC SST model.
Abstract
Applications often communicate data that is non-contiguous in the send- or the receive-buffer, e.g., when exchanging a column of a matrix stored in row-major order. While non-contiguous transfers are well supported in HPC (e.g., MPI derived datatypes), they can still be up to 5x slower than contiguous transfers of the same size. As we enter the era of network acceleration, we need to investigate which tasks to offload to the NIC: In this work we argue that non-contiguous memory transfers can be transparently networkaccelerated, truly achieving zero-copy communications. We implement and extend sPIN, a packet streaming processor, within a Portals 4 NIC SST model, and evaluate strategies for NIC-offloaded processing of MPI datatypes, ranging from datatype-specific handlers to general solutions for any MPI datatype. We demonstrate up to 10x speedup in the unpack throughput of real…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
