Enhancing IoT Cyber Attack Detection in the Presence of Highly Imbalanced Data
Md. Ehsanul Haque, Md. Saymon Hosen Polash, Md Al-Imran Sanjida Simla, Md Alomgir Hossain, Sarwar Jahan

TL;DR
This paper proposes hybrid sampling techniques combined with machine learning models to improve IoT cyber attack detection accuracy in highly imbalanced datasets, achieving near-perfect classification performance.
Contribution
It introduces hybrid sampling methods tailored for IoT security datasets and evaluates their effectiveness with multiple ML models, highlighting the superior performance of Random Forest and Soft Voting ensemble.
Findings
Random Forest achieved a Kappa score of 0.9903 and accuracy of 0.9961.
Soft Voting ensemble achieved an accuracy of 0.9952 and AUC of 0.9997.
Hybrid sampling significantly improves attack detection in imbalanced IoT datasets.
Abstract
Due to the rapid growth in the number of Internet of Things (IoT) networks, the cyber risk has increased exponentially, and therefore, we have to develop effective IDS that can work well with highly imbalanced datasets. A high rate of missed threats can be the result, as traditional machine learning models tend to struggle in identifying attacks when normal data volume is much higher than the volume of attacks. For example, the dataset used in this study reveals a strong class imbalance with 94,659 instances of the majority class and only 28 instances of the minority class, making it quite challenging to determine rare attacks accurately. The challenges presented in this research are addressed by hybrid sampling techniques designed to improve data imbalance detection accuracy in IoT domains. After applying these techniques, we evaluate the performance of several machine learning models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLogistic Regression · Feature Selection
