Mitigating Shared Storage Congestion Using Control Theory
Thomas Collignon (1, 2, 3), Kouds Halitim (4, 5), Rapha\"el Bleuse (4, 5), Sophie Cerf (4, 5), Bogdan Robu (6), \'Eric Rutten (4, 5), Lionel Seinturier (7, 2, 8, 1), Alexandre van Kempen (3) ((1) SPIRALS - Self-adaptation for distributed services, large software systems

TL;DR
This paper introduces a control theory-based method to dynamically regulate I/O rates in shared HPC environments, effectively reducing congestion and improving overall performance stability.
Contribution
It presents a novel self-adaptive control approach that uses runtime metrics to mitigate shared storage congestion in HPC systems.
Findings
Reduces total runtime by up to 20%.
Lowers tail latency under workload.
Maintains stable performance during congestion.
Abstract
Efficient data access in High-Performance Computing (HPC) systems is essential to the performance of intensive computing tasks. Traditional optimizations of the I/O stack aim to improve peak performance but are often workload specific and require deep expertise, making them difficult to generalize or re-use. In shared HPC environments, resource congestion can lead to unpredictable performance, causing slowdowns and timeouts. To address these challenges, we propose a self-adaptive approach based on Control Theory to dynamically regulate client-side I/O rates. Our approach leverages a small set of runtime system load metrics to reduce congestion and enhance performance stability. We implement a controller in a multi-node cluster and evaluate it on a real testbed under a representative workload. Experimental results demonstrate that our method effectively mitigates I/O congestion, reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Advanced Data Storage Technologies
