Efficient implementation of the overlap operator on multi-GPUs
Andrei Alexandru, Michael Lujan, Craig Pelissier, Ben Gamari, Frank X., Lee

TL;DR
This paper presents efficient methods for implementing the overlap operator in lattice QCD calculations on multi-GPU systems, demonstrating significant performance advantages over CPU clusters.
Contribution
The paper introduces optimized implementation techniques for the overlap operator on multi-GPU architectures, addressing memory and parallelization challenges.
Findings
GPU clusters outperform CPU clusters by a factor of 20-30 in core count for similar performance
Efficient multi-GPU implementation reduces computational resources needed for lattice QCD
Demonstrates scalability of the overlap operator on GPU clusters
Abstract
Lattice QCD calculations were one of the first applications to show the potential of GPUs in the area of high performance computing. Our interest is to find ways to effectively use GPUs for lattice calculations using the overlap operator. The large memory footprint of these codes requires the use of multiple GPUs in parallel. In this paper we show the methods we used to implement this operator efficiently. We run our codes both on a GPU cluster and a CPU cluster with similar interconnects. We find that to match performance the CPU cluster requires 20-30 times more CPU cores than GPUs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Particle physics theoretical and experimental studies · Quantum Chromodynamics and Particle Interactions
