TL;DR
This paper evaluates various machine learning classifiers and imbalance handling techniques on cybersecurity datasets, highlighting the importance of dataset-specific testing for optimal performance in imbalanced classification tasks.
Contribution
It provides a comprehensive comparison of classifiers and sampling methods for imbalanced cybersecurity datasets, emphasizing the need for tailored approaches.
Findings
Imbalance techniques have mixed effects, sometimes improving and sometimes degrading performance.
Different classifiers perform best on different datasets, indicating no one-size-fits-all solution.
Testing multiple classifiers and techniques is recommended for each new cybersecurity dataset.
Abstract
Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Technique, and Self-Paced Ensembling. In the last…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection · Logistic Regression
