Performance analysis of the Kahan-enhanced scalar product on current multi- and manycore processors
Johannes Hofmann, Dietmar Fey, Michael Riedmann, Jan Eitzinger, Georg, Hager, Gerhard Wellein

TL;DR
This paper analyzes the performance of a Kahan-enhanced scalar product on modern multi- and manycore processors, showing it can be nearly as efficient as naive implementations with proper low-level optimizations.
Contribution
It provides a detailed performance analysis and SIMD-optimized implementation of the Kahan scalar product across multiple architectures, extending the ECM model.
Findings
Kahan-enhanced scalar product has minimal overhead with optimizations
Performance bottlenecks identified through instruction analysis
Extended ECM model predicts performance across architectures
Abstract
We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient SIMD-vectorized implementations on recent multi- and manycore processors. Using low-level instruction analysis and the execution-cache-memory (ECM) performance model we pinpoint the relevant performance bottlenecks for single-core and thread-parallel execution, and predict performance and saturation behavior. We show that the Kahan-enhanced scalar product comes at almost no additional cost compared to the naive (non-Kahan) scalar product if appropriate low-level optimizations, notably SIMD vectorization and unrolling, are applied. The ECM model is extended appropriately to accommodate not only modern Intel multicore chips but also the Intel Xeon Phi "Knights Corner" coprocessor and an IBM POWER8…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
