GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems
Baodi Shan, Mauricio Araya-Polo, Barbara Chapman

TL;DR
GICC is a high-performance runtime that enables GPU kernels to directly trigger NIC operations, reducing latency and improving scalability in modern HPC systems with OFI-based interconnects.
Contribution
It introduces GPU-initiated communication and asynchronous resource reclamation, enabling efficient, host-free coordination on OFI-based interconnects like Slingshot and InfiniBand.
Findings
GICC reduces per-coordination latency by up to 229x on Slingshot.
GICC improves weak scaling efficiency by up to 25%.
GICC achieves 42% parallel efficiency on industrial stencil proxy.
Abstract
Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based interconnects such as HPE Slingshot, which powers six of the top ten systems in the November 2025 Top500, including the top three, GPU kernels cannot autonomously drive distributed coordination: existing runtimes rely on host-driven progress and lack a bounded mechanism for recycling pre-staged NIC work across repeated GPU-triggered operations. On InfiniBand, GPU-initiated communication is possible, but current implementations incur unnecessary synchronization and locking overheads. This paper presents GICC, a framework that enables GPU kernels to directly trigger NIC-level operations without host involvement on the fast path. In stencils, GPU threads initiate halo exchanges as soon as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
