EMBERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis
Dragos Georgian Corlatescu, Alexandru Dinu, Mihaela Gaman, Paul, Sumedrea

TL;DR
EMBERSim is an augmented version of the EMBER malware dataset that includes similarity information and class tags, aiming to facilitate research in malware similarity detection and improve robustness against detection bypass techniques.
Contribution
We introduce EMBERSim, a large-scale malware databank with similarity tags and class labels, enhancing research capabilities in malware similarity and detection.
Findings
Published EMBERSim with similarity-informed tags
Enriched EMBERSim with malware class labels from VirusTotal
Shared implementation of class scoring and leaf similarity methods
Abstract
In recent years there has been a shift from heuristics-based malware detection towards machine learning, which proves to be more robust in the current heavily adversarial threat landscape. While we acknowledge machine learning to be better equipped to mine for patterns in the increasingly high amounts of similar-looking files, we also note a remarkable scarcity of the data available for similarity-targeted research. Moreover, we observe that the focus in the few related works falls on quantifying similarity in malware, often overlooking the clean data. This one-sided quantification is especially dangerous in the context of detection bypass. We propose to address the deficiencies in the space of similarity research on binary files, starting from EMBER - one of the largest malware classification data sets. We enhance EMBER with similarity information as well as malware class tags, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications
