Synthetic Datasets for Program Similarity Research
Alexander Interrante-Grant, Michael Wang, Lisa Baer, Ryan Whelan, Tim, Leek

TL;DR
This paper introduces HELIX, a framework for generating large, realistic synthetic datasets for program similarity research, addressing the scarcity and quality issues of existing datasets.
Contribution
The paper presents HELIX and Blind HELIX, novel tools for creating and extracting synthetic program similarity datasets with practical ground truth labels.
Findings
HELIX can generate large, realistic datasets for program similarity research.
Program similarity tools perform differently on HELIX datasets compared to handcrafted datasets.
Blind HELIX automates extraction of dataset components from library code.
Abstract
Program similarity has become an increasingly popular area of research with various security applications such as plagiarism detection, author identification, and malware analysis. However, program similarity research faces a few unique dataset quality problems in evaluating the effectiveness of novel approaches. First, few high-quality datasets for binary program similarity exist and are widely used in this domain. Second, there are potentially many different, disparate definitions of what makes one program similar to another and in many cases there is often a large semantic gap between the labels provided by a dataset and any useful notion of behavioral or semantic similarity. In this paper, we present HELIX - a framework for generating large, synthetic program similarity datasets. We also introduce Blind HELIX, a tool built on top of HELIX for extracting HELIX components from library…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Advanced Malware Detection Techniques · Teaching and Learning Programming
