Making root cause analysis feasible for large code bases: a solution approach for a climate model
Daniel J. Milroy, Allison H. Baker, Dorit M. Hammerling, Youngsung, Kim, Elizabeth R. Jessup, Thomas Hauser

TL;DR
This paper introduces a scalable approach to root cause analysis in large code bases like the CESM climate model, enabling developers to trace output discrepancies to their sources efficiently.
Contribution
The work presents a novel technique combining graph analysis, program slicing, and ranking to reduce the search space for root cause analysis in large-scale scientific codes.
Findings
Effective reduction of search space for root cause analysis.
Successful identification of error sources in CESM simulations.
Improved debugging efficiency for large scientific code bases.
Abstract
For large-scale simulation codes with huge and complex code bases, where bit-for-bit comparisons are too restrictive, finding the source of statistically significant discrepancies (e.g., from a previous version, alternative hardware or supporting software stack) in output is non-trivial at best. Although there are many tools for program comprehension through debugging or slicing, few (if any) scale to a model as large as the Community Earth System Model (CESM; trademarked), which consists of more than 1.5 million lines of Fortran code. Currently for the CESM, we can easily determine whether a discrepancy exists in the output using a by now well-established statistical consistency testing tool. However, this tool provides no information as to the possible cause of the detected discrepancy, leaving developers in a seemingly impossible (and frustrating) situation. Therefore, our aim in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Scientific Computing and Data Management · Distributed and Parallel Computing Systems
