Reducing the Cost of Training Security Classifier (via Optimized Semi-Supervised Learning)
Rui Shu, Tianpei Xia, Huy Tu, Laurie Williams, Tim Menzies

TL;DR
This paper introduces Dapper, an adaptive semi-supervised learning framework that reduces labeling costs for security classifiers by optimizing pseudo-labeling, classifier parameters, and oversampling techniques, achieving high performance with minimal labeled data.
Contribution
Dapper is a novel framework that combines semi-supervised learning, hyperparameter optimization, and data oversampling to improve security classification with limited labeled data.
Findings
Achieves comparable or better performance with only 10% labeled data.
Effectively handles class imbalance with adaptive SMOTE integration.
Demonstrates robustness across multiple security datasets.
Abstract
Background: Most of the existing machine learning models for security tasks, such as spam detection, malware detection, or network intrusion detection, are built on supervised machine learning algorithms. In such a paradigm, models need a large amount of labeled data to learn the useful relationships between selected features and the target class. However, such labeled data can be scarce and expensive to acquire. Goal: To help security practitioners train useful security classification models when few labeled training data and many unlabeled training data are available. Method: We propose an adaptive framework called Dapper, which optimizes 1) semi-supervised learning algorithms to assign pseudo-labels to unlabeled data in a propagation paradigm and 2) the machine learning classifier (i.e., random forest). When the dataset class is highly imbalanced, Dapper then adaptively integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Network Security and Intrusion Detection · Advanced Malware Detection Techniques
