Data Curation and Quality Assurance for Machine Learning-based Cyber Intrusion Detection
Haihua Chen, Ngan Tran, Anand Sagar Thumati, Jay Bhuyan, Junhua Ding

TL;DR
This paper emphasizes the importance of data quality in machine learning-based intrusion detection, analyzing datasets and models to improve system performance through data quality assessment.
Contribution
It introduces a data quality evaluation framework for intrusion detection datasets and demonstrates its impact on model performance.
Findings
BERT and GPT outperform other models across datasets
Dataset quality varies significantly affecting detection accuracy
Proposed data quality dimensions guide dataset improvement
Abstract
Intrusion detection is an essential task in the cyber threat environment. Machine learning and deep learning techniques have been applied for intrusion detection. However, most of the existing research focuses on the model work but ignores the fact that poor data quality has a direct impact on the performance of a machine learning system. More attention should be paid to the data work when building a machine learning-based intrusion detection system. This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems. Then the data preparation workflow and quality requirements for intrusion detection are discussed. To figure out how data and models affect machine learning performance, we conducted experiments on 11 HIDS datasets using seven machine learning models and three deep learning models. The experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Linear Warmup With Linear Decay · Cosine Annealing · WordPiece · Byte Pair Encoding · Attention Dropout · Dropout · Weight Decay
