ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage
Chang Liu, Noah Fleischmann, Nicol\`o Altamura, Edward Raff, James Holt, Kristopher Micinski

TL;DR
This paper introduces ASSEMBLAGE-DEEPHISTORY, a comprehensive binary dataset with multi-dimensional metadata, enabling advanced analysis of binary variations, vulnerabilities, and historical build patterns across multiple projects and compilers.
Contribution
It creates a unified, queryable framework combining cross-build diversity, version history, and CVE labels in a large, multi-compiler, multi-year binary dataset with diverse analyses.
Findings
LLM benchmark for binary vulnerability reasoning
Clustering of package versions using embeddings and fuzzy hashes
Bayesian analysis of factors influencing binary similarity
Abstract
Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cross-version history, and CVE labels into a queryable structure. We present ASSEMBLAGE-DEEPHISTORY, which consolidates these dimensions into a unified framework where every binary's compilation context, source code, vulnerable functions, and package version are stored as first-class metadata. ASSEMBLAGE-DEEPHISTORY comprises 73,610 binaries spanning 248 open-source projects, compiled across GCC, Clang, and MSVC at multiple optimization levels on Linux and Windows, with multi-year historical builds. Each binary is indexed in a database that links it to its source code, functions, debug info, variant builds, historical versions, and vulnerable functions. Three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
