Assemblage: Automatic Binary Dataset Construction for Machine Learning
Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend, Southard Pantano, James Holt, Kristopher Micinski

TL;DR
Assemblage is a scalable, cloud-based system that automatically constructs large, high-quality binary datasets from Windows and Linux binaries, facilitating improved machine learning models for binary analysis tasks.
Contribution
It introduces a reproducible, extensible platform for automatic binary dataset creation, overcoming limitations of existing corpora for binary analysis research.
Findings
Produced 890k Windows PE binaries and 428k Linux ELF binaries.
Enabled training of modern binary analysis models with high-quality datasets.
Demonstrated the importance of robust corpora for machine learning in binary analysis.
Abstract
Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsBig Data Technologies and Applications
