TL;DR
This paper extends the HPX C++ runtime system to seamlessly integrate CUDA GPU processing, enabling asynchronous data transfers and kernel launches across distributed systems for improved resource utilization.
Contribution
It introduces a novel integration of CUDA within HPX, allowing asynchronous GPU operations to be managed within the HPX execution model for distributed applications.
Findings
Asynchronous GPU data transfers and kernel launches are effectively integrated into HPX.
The approach enables full utilization of local and remote GPUs in distributed systems.
Overhead measurements show no additional computational cost from integration.
Abstract
Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters, are expected to aggravate this issue as the number of cores are expected to increase and memory hierarchies are expected to become deeper. One big aspect for distributed applications is to guarantee high utilization of all available resources, including local or remote acceleration cards on a cluster while fully using all the available CPU resources and the integration of the GPU work into the overall programming model. For the integration of CUDA code we extended HPX, a general purpose C++ run time system for parallel and distributed applications of any scale, and enabled asynchronous data transfers from and to the GPU device and the asynchronous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
