PADLL: Taming Metadata-intensive HPC Jobs Through Dynamic, Application-agnostic QoS Control
Ricardo Macedo, Mariana Miranda, Yusuke Tanimura, Jason Haga, Amit, Ruhela, Stephen Lien Harrell, Richard Todd Evans, Jos\'e Pereira, Jo\~ao, Paulo

TL;DR
PADLL is a middleware solution that manages data and metadata workflows in HPC storage systems, ensuring QoS, fairness, and prioritization for metadata-intensive jobs through dynamic, application-agnostic control.
Contribution
It introduces PADLL, a novel middleware that applies Software-Defined Storage principles to enforce QoS policies in HPC environments, handling metadata and data workflows effectively.
Findings
Enforces complex QoS policies for concurrent jobs.
Ensures fairness and prioritization in metadata operations.
Proven effective with synthetic benchmarks, real applications, and production traces.
Abstract
Modern I/O applications that run on HPC infrastructures are increasingly becoming read and metadata intensive. However, having multiple concurrent applications submitting large amounts of metadata operations can easily saturate the shared parallel file system's metadata resources, leading to overall performance degradation and I/O unfairness. We present PADLL, an application and file system agnostic storage middleware that enables QoS control of data and metadata workflows in HPC storage systems. It adopts ideas from Software-Defined Storage, building data plane stages that mediate and rate limit POSIX requests submitted to the shared file system, and a control plane that holistically coordinates how all I/O workflows are handled. We demonstrate its performance and feasibility under multiple QoS policies using synthetic benchmarks, real-world applications, and traces collected from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Cloud Computing and Resource Management
