Real Faults in Deep Learning Fault Benchmarks: How Real Are They?
Gunel Jahangirova, Nargiz Humbatova, Jinhan Kim, Shin Yoo and, Paolo Tonella

TL;DR
This paper critically examines the realism and reproducibility of faults in DL benchmarks, revealing that only a small fraction are truly representative and reproducible, highlighting challenges in fault evaluation.
Contribution
The study provides a manual analysis of 490 faults, assessing their realism and reproducibility, and exposes limitations in current DL fault benchmarks.
Findings
Only 18.5% of faults meet realism criteria
Faults were reproducible in only 52% of cases
Most faults do not accurately reflect real-world issues
Abstract
As the adoption of Deep Learning (DL) systems continues to rise, an increasing number of approaches are being proposed to test these systems, localise faults within them, and repair those faults. The best attestation of effectiveness for such techniques is an evaluation that showcases their capability to detect, localise and fix real faults. To facilitate these evaluations, the research community has collected multiple benchmarks of real faults in DL systems. In this work, we perform a manual analysis of 490 faults from five different benchmarks and identify that 314 of them are eligible for our study. Our investigation focuses specifically on how well the bugs correspond to the sources they were extracted from, which fault types are represented, and whether the bugs are reproducible. Our findings indicate that only 18.5% of the faults satisfy our realism conditions. Our attempts to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Anomaly Detection Techniques and Applications · Software Engineering Research
