Performance Portability Strategies for Grid C++ Expression Templates

Peter A. Boyle; M. A. Clark; Carleton DeTar; Meifeng Lin; Verinder; Rana; Alejandro Vaquero Avil\'es-Casco

arXiv:1710.09409·hep-lat·April 18, 2018

Performance Portability Strategies for Grid C++ Expression Templates

Peter A. Boyle, M. A. Clark, Carleton DeTar, Meifeng Lin, Verinder, Rana, Alejandro Vaquero Avil\'es-Casco

PDF

TL;DR

This paper explores strategies for achieving performance portability of Grid C++ expression templates across various architectures, focusing on GPU offloading with CUDA, OpenACC, and JIT, highlighting successes and challenges encountered.

Contribution

It presents new GPU offloading strategies for Grid C++ expression templates, including experimental results and analysis of using CUDA, OpenACC, and JIT compilation.

Findings

01

Successful GPU offloading with CUDA and OpenACC

02

Challenges with OpenMP 4.x for GPU offloading

03

Performance results on GPUs with SU(3)×SU(3) streaming test

Abstract

One of the key requirements for the Lattice QCD Application Development as part of the US Exascale Computing Project is performance portability across multiple architectures. Using the Grid C++ expression template as a starting point, we report on the progress made with regards to the Grid GPU offloading strategies. We present both the successes and issues encountered in using CUDA, OpenACC and Just-In-Time compilation. Experimentation and performance on GPUs with a SU(3) $\times$ SU(3) streaming test will be reported. We will also report on the challenges of using current OpenMP 4.x for GPU offloading in the same code.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.