Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale
Byung H. Park, Saurabh Hukerikar, Ryan Adamson, Christian Engelmann

TL;DR
This paper presents a scalable log analytics framework combining NoSQL and Apache Spark to analyze large HPC logs, enabling system health insights and fault diagnosis at extreme scale.
Contribution
It introduces a novel distributed analytics framework tailored for HPC log data, leveraging NoSQL and Spark for scalable, high-throughput processing.
Findings
Effective extraction of system behavior insights from Titan supercomputer logs
Framework demonstrates scalability and high availability for large-scale HPC data
Enables detailed fault and failure analysis in complex HPC systems
Abstract
Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures, and analyze an application's interactions with the system, providing valuable insights to domain scientists and system administrators. However, processing HPC logs requires a deep understanding of hardware and software components at multiple layers of the system stack. Moreover, most log data is unstructured and voluminous, making it more difficult for system users and administrators to manually inspect the data. With rapid increases in the scale and complexity of HPC systems, log data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
