Demystifying the Silence of Correctness Bugs in PyTorch Compiler
Meiziniu Li, Dongze Li, Jianmeng Liu, Shing-Chi Cheung

TL;DR
This paper presents an empirical study of correctness bugs in the PyTorch compiler, introduces a tailored testing technique called AlignGuard, and reports successful detection of 23 new bugs confirmed or fixed by the PyTorch team.
Contribution
It is the first systematic analysis of correctness bugs in torch.compile and proposes a novel LLM-based testing method for their detection.
Findings
19.2% of high-priority issues are correctness bugs caused by torch.compile.
AlignGuard detected 23 new correctness bugs, with 14 marked as high-priority.
All detected bugs were confirmed or fixed by the PyTorch team.
Abstract
Performance optimization of AI infrastructure is key to the fast adoption of large language models (LLMs). The PyTorch compiler (torch.compile), a core optimization tool for deep learning (DL) models (including LLMs), has received due attention. However, torch.compile is prone to correctness bugs, which cause incorrect outputs of compiled DL models without triggering exceptions, crashes, or warnings. These bugs pose a serious threat to the reliability of downstream LLM applications. Data from the PyTorch community shows that 19.2% of high-priority issues are incorrect outputs of compiled DL models induced by torch.compile bugs, the second-most-common bug category (only behind program crashes at 19.57%). However, no systematic study has been conducted to specifically characterize and thereby detect these bugs. In this paper, we present the first empirical study of the correctness bugs in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
