KiloGrams: Very Large N-Grams for Malware Classification
Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill, Finlayson, Charles Nicholas, Mark McLean

TL;DR
This paper introduces a fast method for extracting very large n-grams (up to 1024) for malware classification, demonstrating their usefulness in creating interpretable features and industry-compatible signatures.
Contribution
It presents a novel, efficient approach to find top-k frequent large n-grams, enabling their use in malware detection and signature generation.
Findings
Large n-grams retain predictive power for malware classification.
The method is 60 times faster for small n and scalable to n ≥ 1024.
Large n-grams improve interpretability and signature creation.
Abstract
N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of are tested, with being exceedingly rare. Larger values of are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top- most frequent -grams that is 60 faster for small , and can tackle large . Despite the unprecedented size of considered, we show how these features still have predictive ability for malware classification tasks. More important, large -grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Furthermore, the counts of common -grams in a file may be added as features to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Anomaly Detection Techniques and Applications · Machine Learning and Data Classification
