IOAgent: Democratizing Trustworthy HPC I/O Performance Diagnosis Capability via LLMs
Chris Egersdoerfer, Arnav Sareen, Jean Luca Bez, Suren Byna, Dongkuan Xu, Dong Dai

TL;DR
IOAgent leverages large language models and domain knowledge integration to automate and democratize trustworthy diagnosis of HPC I/O performance issues, enhancing accessibility and accuracy for scientists.
Contribution
The paper introduces IOAgent, a novel LLM-based system that accurately diagnoses HPC I/O issues, integrating domain knowledge and providing an interactive, explainable diagnosis tool.
Findings
IOAgent matches or outperforms state-of-the-art diagnosis tools.
It is effective with both proprietary and open-source LLMs.
The system is evaluated on a new open test suite, TraceBench.
Abstract
As the complexity of the HPC storage stack rapidly grows, domain scientists face increasing challenges in effectively utilizing HPC storage systems to achieve their desired I/O performance. To identify and address I/O issues, scientists largely rely on I/O experts to analyze their I/O traces and provide insights into potential problems. However, with a limited number of I/O experts and the growing demand for data-intensive applications, inaccessibility has become a major bottleneck, hindering scientists from maximizing their productivity. Rapid advances in LLMs make it possible to build an automated tool that brings trustworthy I/O performance diagnosis to domain scientists. However, key challenges remain, such as the inability to handle long context windows, a lack of accurate domain knowledge about HPC I/O, and the generation of hallucinations during complex interactions. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
