Clustering Malware at Scale: A First Full-Benchmark Study

Martin Mocko; Jakub \v{S}evcech; Daniela Chud\'a

arXiv:2511.23198·cs.CR·December 3, 2025

Clustering Malware at Scale: A First Full-Benchmark Study

Martin Mocko, Jakub \v{S}evcech, Daniela Chud\'a

PDF

Open Access

TL;DR

This study performs the first comprehensive benchmarking of malware clustering on large public datasets, including benign samples, revealing that simple algorithms like K-Means and BIRCH perform best.

Contribution

It introduces a full-benchmark study of malware clustering on large datasets and extends the task to include benign samples, providing new insights into clustering performance.

Findings

01

K-Means and BIRCH outperform other algorithms

02

Including benign samples does not significantly affect clustering quality

03

Clustering quality varies across datasets and industry samples

Abstract

Recent years have shown that malware attacks still happen with high frequency. Malware experts seek to categorize and classify incoming samples to confirm their trustworthiness or prove their maliciousness. One of the ways in which groups of malware samples can be identified is through malware clustering. Despite the efforts of the community, malware clustering which incorporates benign samples has been under-explored. Moreover, despite the availability of larger public benchmark malware datasets, malware clustering studies have avoided fully utilizing these datasets in their experiments, often resorting to small datasets with only a few families. Additionally, the current state-of-the-art solutions for malware clustering remain unclear. In our study, we evaluate malware clustering quality and establish the state-of-the-art on Bodmas and Ember - two large public benchmark malware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Spam and Phishing Detection