Signature Limits: An Entire Map of Clone Features and their Discovery in Nearly Linear Time
William Casey, Aaron Shelmire

TL;DR
This paper presents a practical method for mapping software code clones in binary data using enhanced suffix data structures, aiding malware analysis and provenance detection with efficient algorithms and similarity measures.
Contribution
Introduces a novel methodology employing enhanced suffix data structures for complete clone feature mapping and provenance relation discovery in binary artifacts.
Findings
Effective clone feature enumeration in malware data
Discovery of provenance relations using new similarity coefficients
Practical approach demonstrated on real malware datasets
Abstract
We address the problem of creating entire and complete maps of software code clones (copy features in data) in a corpus of binary artifacts of unknown provenance. We report on a practical methodology, which employs enhanced suffix data structures and partial orderings of clones to compute a compact representation of most interesting clones features in data. The enumeration of clone features is useful for malware triage and prioritization when human exploration, testing and verification is the most costly factor. We further show that the enhanced arrays may be used for discovery of provenance relations in data and we introduce two distinct Jaccard similarity coefficients to measure code similarity in binary artifacts. We illustrate the use of these tools on real malware data including a retro-diction experiment for measuring and enumerating evidence supporting common provenance in {\it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
