The Flux Operator
Vanessa Sochat, Aldo Culquicondor, Antonio Ojea, and Daniel Milroy

TL;DR
This paper introduces the Flux Operator, enabling hierarchical HPC workload management within Kubernetes, bridging cloud and HPC environments for scalable, efficient job scheduling and resource management.
Contribution
It presents the design and implementation of the Flux Operator, integrating Flux Framework's capabilities into Kubernetes for improved workload orchestration.
Findings
Flux Operator achieves scalable hierarchical resource management.
Performance comparable or superior to MPI Operator.
Facilitates convergence of cloud and HPC workload management.
Abstract
Converged computing brings together the best of both worlds for high performance computing (HPC) and cloud-native communities. In fact, the economic impact of cloud-computing, and need for portability, flexibility, and manageability make it not important, but inevitable. Navigating this uncharted territory requires not just innovation in the technology space, but also effort toward collaboration and sharing of ideas. With these goals in mind, this work first tackles the central component of running batch workflows, whether in cloud or HPC: the workload manager. For cloud, Kubernetes has become the de facto tool for this kind of batch orchestration. For HPC, the next-generation HPC workload manager Flux Framework is analogous -- combining fully hierarchical resource management and graph-based scheduling to support intelligent scheduling and job management. Convergence of these managers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Scientific Computing and Data Management
