Big Data Meets HPC Log Analytics: Scalable Approach to Understanding   Systems at Extreme Scale

Byung H. Park; Saurabh Hukerikar; Ryan Adamson; Christian Engelmann

arXiv:1708.06884·cs.DC·August 24, 2017

Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale

Byung H. Park, Saurabh Hukerikar, Ryan Adamson, Christian Engelmann

PDF

TL;DR

This paper presents a scalable log analytics framework combining NoSQL and Apache Spark to analyze large HPC logs, enabling system health insights and fault diagnosis at extreme scale.

Contribution

It introduces a novel distributed analytics framework tailored for HPC log data, leveraging NoSQL and Spark for scalable, high-throughput processing.

Findings

01

Effective extraction of system behavior insights from Titan supercomputer logs

02

Framework demonstrates scalability and high availability for large-scale HPC data

03

Enables detailed fault and failure analysis in complex HPC systems

Abstract

Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures, and analyze an application's interactions with the system, providing valuable insights to domain scientists and system administrators. However, processing HPC logs requires a deep understanding of hardware and software components at multiple layers of the system stack. Moreover, most log data is unstructured and voluminous, making it more difficult for system users and administrators to manually inspect the data. With rapid increases in the scale and complexity of HPC systems, log data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.