A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection
Udbhav Prasad, Aniesh Chawla

TL;DR
This paper systematically compares various learning-based similarity techniques for malware detection using a unified framework, revealing that combining different methods yields better results than relying on a single approach.
Contribution
It provides the first reproducible benchmark of diverse learning-based similarity methods for malware detection under a unified evaluation framework.
Findings
No single technique outperforms others across all metrics.
Different methods exhibit distinct strengths and trade-offs.
Combining multiple techniques improves malware detection effectiveness.
Abstract
Cryptographic digests (e.g., MD5, SHA-256) are designed to provide exact identity. Any single-bit change in the input produces a completely different hash, which is ideal for integrity verification but limits their usefulness in many real-world tasks like threat hunting, malware analysis and digital forensics, where adversaries routinely introduce minor transformations. Similarity-based techniques address this limitation by enabling approximate matching, allowing related byte sequences to produce measurably similar fingerprints. Modern enterprises manage tens of thousands of endpoints with billions of files, making the effectiveness and scalability of the proposed techniques more important than ever in security applications. Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes (e.g., ssdeep, sdhash, TLSH), as well as more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Network Security and Intrusion Detection
