Identifying meaningful clusters in malware data

Renato Cordeiro de Amorim; Carlos David Lopez Ruiz

arXiv:2008.01175·cs.CR·April 26, 2021

Identifying meaningful clusters in malware data

Renato Cordeiro de Amorim, Carlos David Lopez Ruiz

PDF

TL;DR

This paper presents an iterative feature relevance-based data pre-processing method that enhances cluster separation in malware data, resulting in clearer clusters and higher silhouette scores.

Contribution

The paper introduces a novel iterative pre-processing technique that improves clustering quality by emphasizing more relevant features in malware data.

Findings

01

Clusters became more distinct after applying the method.

02

Silhouette width increased, indicating better clustering.

03

The approach effectively separates overlapping malware groups.

Abstract

Finding meaningful clusters in drive-by-download malware data is a particularly difficult task. Malware data tends to contain overlapping clusters with wide variations of cardinality. This happens because there can be considerable similarity between malware samples (some are even said to belong to the same family), and these tend to appear in bursts. Clustering algorithms are usually applied to normalised data sets. However, the process of normalisation aims at setting features with different range values to have a similar contribution to the clustering. It does not favour more meaningful features over those that are less meaningful, an effect one should perhaps expect of the data pre-processing stage. In this paper we introduce a method to deal precisely with the problem above. This is an iterative data pre-processing method capable of aiding to increase the separation between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.