Three Dirac operators on two architectures with one piece of code and no hassle
Stephan Durr

TL;DR
This paper presents a straightforward implementation of three Dirac operator discretizations on two different computer architectures using high-level programming tools, achieving high performance without complex optimization.
Contribution
It introduces a simple, portable approach to implement multiple Dirac operators with high efficiency across architectures using high-level compiler directives.
Findings
Achieved up to 790 Gflop/s performance on KNL for the discretizations.
Implemented three discretizations with a unified high-level approach.
Demonstrated portability and high performance without cache-line tuning.
Abstract
A simple minded approach to implement three discretizations of the Dirac operator (staggered, Wilson, Brillouin) on two architectures (KNL and core i7) is presented. The idea is to use a high-level compiler along with OpenMP parallelization and SIMD pragmas, but to stay away from cache-line optimization and/or assembly-tuning. The implementation is for N_v right-hand-sides, and this extra index is used to fill the SIMD pipeline. On one KNL node single precision performance figures for N_c=3, N_v=12 read 475 Gflop/s, 345 Gflop/s, and 790 Gflop/s for the three discretization schemes, respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Quantum Chromodynamics and Particle Interactions · Particle physics theoretical and experimental studies
