On Normalized Compression Distance and Large Malware
Rebecca Schuller Borbely

TL;DR
This paper investigates the limitations of Normalized Compression Distance (NCD) in classifying large malware files, revealing that popular compression algorithms often lack necessary theoretical properties, and proposes variants to address this issue.
Contribution
The paper identifies practical limitations of NCD with large files and introduces variants of NCD to improve malware classification accuracy.
Findings
Popular compression algorithms often do not satisfy theoretical properties of NCD.
Theoretical issues with NCD impact practical malware classification.
Proposed NCD variants mitigate problems with large malware files.
Abstract
Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD's theoretical merit rely on certain theoretical properties of compression algorithms. However, we demonstrate that many popular compression algorithms don't seem to satisfy these theoretical properties. We explore the relationship between some of these properties and file size, demonstrating that this theoretical problem is actually a practical problem for classifying malware with large file sizes, and we then introduce some variants of NCD that mitigate this problem.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Algorithms and Data Compression · Spam and Phishing Detection
