Test Suites as a Source of Training Data for Static Analysis Alert   Classifiers

Lori Flynn; William Snavely; Zachary Kurtz

arXiv:2105.03523·cs.SE·May 11, 2021

Test Suites as a Source of Training Data for Static Analysis Alert Classifiers

Lori Flynn, William Snavely, Zachary Kurtz

PDF

TL;DR

This paper explores using static analysis test suites as a new source of training data for machine learning classifiers to improve the accuracy of static analysis alert triage, demonstrating promising results with high precision and recall.

Contribution

It introduces a novel approach of leveraging static analysis test suites as training data for alert classifiers, addressing data scarcity issues in static analysis.

Findings

01

Classifiers achieved 90.2% precision and 88.2% recall.

02

Using test suite data can effectively pre-train static analysis alert classifiers.

03

The approach shows promise for data-limited static analysis contexts.

Abstract

Flaw-finding static analysis tools typically generate large volumes of code flaw alerts including many false positives. To save on human effort to triage these alerts, a significant body of work attempts to use machine learning to classify and prioritize alerts. Identifying a useful set of training data, however, remains a fundamental challenge in developing such classifiers in many contexts. We propose using static analysis test suites (i.e., repositories of "benchmark" programs that are purpose-built to test coverage and precision of static analysis tools) as a novel source of training data. In a case study, we generated a large quantity of alerts by executing various static analyzers on the Juliet C/C++ test suite, and we automatically derived ground truth labels for these alerts by referencing the Juliet test suite metadata. Finally, we used this data to train classifiers to predict…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.