Cyber Security Data Science: Machine Learning Methods and their   Performance on Imbalanced Datasets

Mateo Lopez-Ledezma; Gissel Velarde

arXiv:2505.04204·cs.LG·May 8, 2025

Cyber Security Data Science: Machine Learning Methods and their Performance on Imbalanced Datasets

Mateo Lopez-Ledezma, Gissel Velarde

PDF

1 Repo

TL;DR

This paper evaluates various machine learning classifiers and imbalance handling techniques on cybersecurity datasets, highlighting the importance of dataset-specific testing for optimal performance in imbalanced classification tasks.

Contribution

It provides a comprehensive comparison of classifiers and sampling methods for imbalanced cybersecurity datasets, emphasizing the need for tailored approaches.

Findings

01

Imbalance techniques have mixed effects, sometimes improving and sometimes degrading performance.

02

Different classifiers perform best on different datasets, indicating no one-size-fits-all solution.

03

Testing multiple classifiers and techniques is recommended for each new cybersecurity dataset.

Abstract

Cybersecurity has become essential worldwide and at all levels, concerning individuals, institutions, and governments. A basic principle in cybersecurity is to be always alert. Therefore, automation is imperative in processes where the volume of daily operations is large. Several cybersecurity applications can be addressed as binary classification problems, including anomaly detection, fraud detection, intrusion detection, spam detection, or malware detection. We present three experiments. In the first experiment, we evaluate single classifiers including Random Forests, Light Gradient Boosting Machine, eXtreme Gradient Boosting, Logistic Regression, Decision Tree, and Gradient Boosting Decision Tree. In the second experiment, we test different sampling techniques including over-sampling, under-sampling, Synthetic Minority Over-sampling Technique, and Self-Paced Ensembling. In the last…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MateoLopez00/Imbalanced-Learning-Empirical-Evaluation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsBalanced Selection · Logistic Regression