Deploying a sharded MongoDB cluster as a queued job on a shared HPC   architecture

Aaron Saxton; Stephen Squaire

arXiv:2209.15390·cs.DC·October 3, 2022

Deploying a sharded MongoDB cluster as a queued job on a shared HPC architecture

Aaron Saxton, Stephen Squaire

PDF

Open Access

TL;DR

This paper demonstrates deploying a sharded MongoDB cluster as a queued job on an HPC system, enabling concurrent data science workloads and measuring performance across configurations on Blue Waters supercomputer.

Contribution

It introduces a method to run a MongoDB sharded cluster within an HPC job queue, integrating data store deployment with HPC scheduling for data science tasks.

Findings

01

Successful deployment of MongoDB cluster on HPC system

02

Performance measurements for data ingest and queries

03

Insights into configuration impacts on HPC performance

Abstract

Data stores are the foundation on which data science, in all its variations, is built upon. They provide a queryable interface to structured and unstructured data. Data science often starts by leveraging these query features to perform initial data preparation. However, most data stores are designed to run continuously to service disparate user requests with little or no downtime. Many HPC architectures process user requests by job queue scheduler and maintain a shard filesystem to store a jobs persistent data. We deploy a MongoDB sharded cluster with a run script that is designed to run a data science workload concurrently. As our test piece, we run data ingest and data queries to measure the performance with different configurations on the Blue Waters supper computer.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Advanced Data Storage Technologies · Distributed and Parallel Computing Systems