Literature Study on Operational Data Analytics Frameworks in Large-scale Computing Infrastructures
Shekhar Suman, Xiaoyu Chu, Alexandru Iosup

TL;DR
This paper reviews existing operational data analytics frameworks for large-scale computing infrastructures, proposes a comprehensive new framework, and compares it to state-of-the-art solutions to highlight its novelty and benefits.
Contribution
It introduces a holistic ODA framework tailored for large-scale infrastructures, extending previous models and providing a comparative analysis with current frameworks.
Findings
Operational efficiencies achieved by state-of-the-art ODA frameworks
The proposed holistic framework covers multiple infrastructure layers
Comparison highlights the novelty and advantages of the new framework
Abstract
By 2025, there are zettabytes of data generated every year. The size and complexity of modern large-scale computing infrastructures like High-Performance Computing (HPC) systems continue to evolve and become complex, leaving us wondering about their manageability and sustainability concerns. Because of this reason, those complex systems are provided with fine-grained monitoring and Operational Data Analytics (ODA) capabilities to optimise their efficiency. In this literature study, we list the fundamental pillars of the large-scale computing infrastructures which enable its ODA capabilities, and conduct a study of the popular ODA frameworks operating in various such environments (predominantly HPC). Based on that, we propose a more holistic ODA framework matching the various layers of a large-scale graph-processing distributed ecosystem proposed by Sherif Sak et al, that extends the ODA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Graph Theory and Algorithms · Software System Performance and Reliability
