CARAT: Client-Side Adaptive RPC and Cache Co-Tuning for Parallel File Systems
Md Hasanur Rashid, Nathan R. Tallent, Forrest Sheng Bao, Dong Dai

TL;DR
CARAT is an ML-guided framework that enables scalable, online, client-side tuning of RPC and cache parameters in parallel file systems, significantly improving performance in dynamic HPC environments.
Contribution
It introduces a novel, scalable online tuning approach for PFS that operates independently on each client using only local metrics, unlike prior global or pattern-dependent methods.
Findings
Achieves up to 3x performance improvement over static configurations.
Demonstrates effectiveness across diverse I/O patterns and workloads.
Proves scalability and lightweight nature suitable for deployment.
Abstract
Tuning parallel file system in High-Performance Computing (HPC) systems remains challenging due to the complex I/O paths, diverse I/O patterns, and dynamic system conditions. While existing autotuning frameworks have shown promising results in tuning PFS parameters based on applications' I/O patterns, they lack scalability, adaptivity, and the ability to operate online. In this work, focusing on scalable online tuning, we present CARAT, an ML-guided framework to co-tune client-side RPC and caching parameters of PFS, leveraging only locally observable metrics. Unlike global or pattern-dependent approaches, CARAT enables each client to make independent and intelligent tuning decisions online, responding to real-time changes in both application I/O behaviors and system states. We then prototyped CARAT using Lustre and evaluated it extensively across dynamic I/O patterns, real-world HPC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Cloud Computing and Resource Management · Distributed and Parallel Computing Systems
