SoK: The Impact of Unlabelled Data in Cyberthreat Detection

Giovanni Apruzzese; Pavel Laskov; Aliya Tastemirova

arXiv:2205.08944·cs.CR·June 28, 2022

SoK: The Impact of Unlabelled Data in Cyberthreat Detection

Giovanni Apruzzese, Pavel Laskov, Aliya Tastemirova

PDF

2 Repos

TL;DR

This paper systematically reviews semi-supervised learning for cyberthreat detection, analyzing the utility of unlabelled data, proposing evaluation requirements, and benchmarking existing methods to identify performance tradeoffs.

Contribution

It formalizes evaluation requirements for semi-supervised learning in cyberthreat detection and provides the first benchmark assessment of nine methods across multiple datasets.

Findings

01

Unlabelled data can provide statistically significant performance improvements.

02

Current methods show varying effectiveness, with room for improvement.

03

The proposed framework aids in assessing the benefits of unlabelled data.

Abstract

Machine learning (ML) has become an important paradigm for cyberthreat detection (CTD) in the recent years. A substantial research effort has been invested in the development of specialized algorithms for CTD tasks. From the operational perspective, however, the progress of ML-based CTD is hindered by the difficulty in obtaining the large sets of labelled data to train ML detectors. A potential solution to this problem are semisupervised learning (SsL) methods, which combine small labelled datasets with large amounts of unlabelled data. This paper is aimed at systematization of existing work on SsL for CTD and, in particular, on understanding the utility of unlabelled data in such systems. To this end, we analyze the cost of labelling in various CTD tasks and develop a formal cost model for SsL in this context. Building on this foundation, we formalize a set of requirements for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.