Performance analysis of the Kahan-enhanced scalar product on current multicore processors
Johannes Hofmann, Dietmar Fey, Jan Eitzinger, Georg Hager, Gerhard, Wellein

TL;DR
This paper analyzes the performance of a Kahan-enhanced scalar product on modern Intel multicore processors, demonstrating near-native speed with proper low-level optimizations and providing insights into architectural impacts.
Contribution
It provides a detailed performance analysis and optimized SIMD implementations of the Kahan scalar product across multiple Intel processor generations.
Findings
Kahan-enhanced scalar product achieves nearly the same performance as naive implementation with optimizations.
Performance bottlenecks are identified using instruction analysis and ECM model.
Architectural changes significantly affect performance and saturation behavior.
Abstract
We investigate the performance characteristics of a numerically enhanced scalar product (dot) kernel loop that uses the Kahan algorithm to compensate for numerical errors, and describe efficient SIMD-vectorized implementations on recent Intel processors. Using low-level instruction analysis and the execution-cache-memory (ECM) performance model we pinpoint the relevant performance bottlenecks for single-core and thread-parallel execution, and predict performance and saturation behavior. We show that the Kahan-enhanced scalar product comes at almost no additional cost compared to the naive (non-Kahan) scalar product if appropriate low-level optimizations, notably SIMD vectorization and unrolling, are applied. We also investigate the impact of architectural changes across four generations of Intel Xeon processors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
