Noisy Label Learning for Security Defects
Roland Croft, M. Ali Babar, Huaming Chen

TL;DR
This paper introduces robust noisy label learning methods for security defect prediction, addressing label noise issues in vulnerability datasets to improve predictive performance.
Contribution
It proposes a novel two-stage noise cleaning approach for vulnerability prediction, enhancing model accuracy despite noisy labels.
Findings
Improved AUC and recall by up to 8.9% and 23.4% with the proposed method.
Demonstrated effectiveness of noisy label learning in security analytics.
Discussed challenges in achieving performance upper bounds with label noise.
Abstract
Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite the vulnerable class, the non-vulnerable modules are difficult to be verified and determined as truly exploit free given the limited manual efforts available. It results in uncertainty, introduces labeling noise in the datasets and affects conclusion validity. To address this issue, we propose novel learning methods that are robust to label impurities and can leverage the most from limited label data; noisy label learning. We investigate various noisy label learning methods applied to software vulnerability prediction. Specifically, we propose a two-stage learning method based on noise cleaning to identify and remediate the noisy samples, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Advanced Malware Detection Techniques
