Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

Edward Raff; Ryan R. Curtin; Derek Everett; Robert J. Joyce; James Holt

arXiv:2511.13808·cs.CR·November 19, 2025

Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora

Edward Raff, Ryan R. Curtin, Derek Everett, Robert J. Joyce, James Holt

PDF

Open Access

TL;DR

This paper introduces Zipf-Gramming, a fast and scalable method for extracting top-k byte n-grams from large malware datasets, significantly improving malware detection accuracy and efficiency.

Contribution

We develop a novel Zipfian distribution-based top-k n-gram extraction algorithm that scales to terabyte-sized datasets, enabling more effective malware detection models.

Findings

01

Up to 35x faster top-k n-gram extraction

02

30% improvement in malware detection AUC

03

Scalable approach for large-scale malware corpora

Abstract

A classifier using byte n-grams as features is the only approach we have found fast enough to meet requirements in size (sub 2 MB), speed (multiple GB/s), and latency (sub 10 ms) for deployment in numerous malware detection scenarios. However, we've consistently found that 6-8 grams achieve the best accuracy on our production deployments but have been unable to deploy regularly updated models due to the high cost of finding the top-k most frequent n-grams over terabytes of executable programs. Because the Zipfian distribution well models the distribution of n-grams, we exploit its properties to develop a new top-k n-gram extractor that is up to $35 \times$ faster than the previous best alternative. Using our new Zipf-Gramming algorithm, we are able to scale up our production training set and obtain up to 30\% improvement in AUC at detecting new malware. We show theoretically and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Testing and Debugging Techniques · Software Engineering Research