TL;DR
This paper systematically reviews semi-supervised learning for cyberthreat detection, analyzing the utility of unlabelled data, proposing evaluation requirements, and benchmarking existing methods to identify performance tradeoffs.
Contribution
It formalizes evaluation requirements for semi-supervised learning in cyberthreat detection and provides the first benchmark assessment of nine methods across multiple datasets.
Findings
Unlabelled data can provide statistically significant performance improvements.
Current methods show varying effectiveness, with room for improvement.
The proposed framework aids in assessing the benefits of unlabelled data.
Abstract
Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
