Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors
Amrita Mathuriya, Ye Luo, Anouar Benali, Luke Shulenburger, Jeongnim, Kim

TL;DR
This paper presents node-level optimizations and parallelization strategies for B-spline orbital evaluations in Quantum Monte Carlo simulations, significantly improving performance on various modern multi-core architectures.
Contribution
It introduces data layout transformation, cache-aware blocking, and nested threading techniques to enhance B-spline evaluations in QMC, enabling better scalability and efficiency.
Findings
Up to 10x performance improvement on CPUs and accelerators.
Nearly ideal parallel efficiency up to 16 threads on KNL.
Over 4.5x overall speedup of miniQMC application.
Abstract
B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on multi/many-core shared memory processors. To increase SIMD efficiency and bandwidth utilization, we first apply data layout transformation from array-of-structures to structure-of-arrays (SoA). Then by blocking SoA objects, we optimize cache reuse and get sustained throughput for a range of problem sizes. We implement efficient nested threading in B-spline orbital evaluation kernels, paving the way towards enabling strong scaling of QMC simulations. These optimizations are portable on four distinct cache-coherent architectures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
