Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management
Andre Luckow, Ioannis Paraskevakos, George Chantzialexiou and, Shantenu Jha

TL;DR
This paper explores integrating Hadoop with high-performance computing environments through resource management middleware, enabling scientific applications to combine traditional computing with Hadoop-based data analysis.
Contribution
It proposes extensions to the Pilot-Abstraction to unify resource management for HPC and Hadoop, facilitating integrated scientific workflows.
Findings
Extended Pilot-Abstraction supports HPC-Hadoop integration
Enables coupling of simulation and data analytics stages
Provides practical solutions for hybrid environment management
Abstract
High-performance computing platforms such as supercomputers have traditionally been designed to meet the compute demands of scientific applications. Consequently, they have been architected as producers and not consumers of data. The Apache Hadoop ecosystem has evolved to meet the requirements of data processing applications and has addressed many of the limitations of HPC platforms. There exist a class of scientific applications however, that need the collective capabilities of traditional high-performance computing environments and the Apache Hadoop ecosystem. For example, the scientific domains of bio-molecular dynamics, genomics and network science need to couple traditional computing with Hadoop/Spark based analysis. We investigate the critical question of how to present the capabilities of both computing environments to such scientific applications. Whereas this questions needs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
