TL;DR
This paper critically examines the effectiveness of machine learning techniques in reducing false alarms from static analysis tools, revealing data leakage issues and limitations in evaluation methods that overstate their real-world performance.
Contribution
It uncovers subtle data leakage and evaluation issues in prior studies, providing guidelines for more realistic assessment of false alarm detection methods.
Findings
Identified data leakage in previous studies' experimental procedures.
Showed warning labels are inconsistent with human judgment.
Demonstrated that evaluation metrics may overestimate real-world performance.
Abstract
Automatic static analysis tools (ASATs), such as Findbugs, have a high false alarm rate. The large number of false alarms produced poses a barrier to adoption. Researchers have proposed the use of machine learning to prune false alarms and present only actionable warnings to developers. The state-of-the-art study has identified a set of "Golden Features" based on metrics computed over the characteristics and history of the file, code, and warning. Recent studies show that machine learning using these features is extremely effective and that they achieve almost perfect performance. We perform a detailed analysis to better understand the strong performance of the "Golden Features". We found that several studies used an experimental procedure that results in data leakage and data duplication, which are subtle issues with significant implications. Firstly, the ground-truth labels have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
