Improving MPI Error Detection and Repair with Large Language Models and Bug References
Scott Piersall, Yang Gao, Shenyang Liu, Liqiang Wang

TL;DR
This paper enhances large language models for MPI error detection and repair by integrating Few-Shot Learning, Chain-of-Thought reasoning, and Retrieval Augmented Generation, significantly improving accuracy over baseline models.
Contribution
The paper introduces a novel bug detection and repair approach using advanced LLM techniques, achieving substantial accuracy improvements in MPI error handling.
Findings
Error detection accuracy improved from 44% to 77%.
Bug referencing technique generalizes well to other LLMs.
Enhanced methods outperform direct ChatGPT application.
Abstract
Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
