Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications
Sanjif Shanmugavelu, Mathieu Taillefumier, Christopher Culver, Oscar, Hernandez, Mark Coletti, Ada Sedova

TL;DR
This paper investigates how floating-point non-associativity affects reproducibility in HPC and deep learning, analyzing its impacts, and evaluating deterministic approaches including hardware solutions to improve reliability.
Contribution
It provides a comprehensive analysis of floating-point non-associativity effects in modern parallel computing and assesses deterministic hardware as a solution for reproducibility issues.
Findings
Non-associativity causes significant run-to-run variability in HPC and deep learning.
Replacing atomic operations with deterministic alternatives impacts performance and productivity.
Hardware-based determinism, like on Groq accelerators, enhances reproducibility and correctness.
Abstract
Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility can critically affect the efficiency and effectiveness of correctness testing for stochastic programs. Recently, the sensitivity of deep learning training and inference pipelines to floating-point non-associativity has been found to sometimes be extreme. It can prevent certification for commercial applications, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled deep learning models with high-performance computing, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of the statistical properties of floating-point non-associativity within modern…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Numerical Methods and Algorithms
