AsmRAG: LLM-Driven Malware Detection by Retrieving Functionally Similar Assembly Code
ElMouatez Billah Karbab

TL;DR
AsmRAG introduces a retrieval-based malware detection framework using assembly code embeddings and LLMs, providing interpretable evidence and robustness against obfuscation, with high accuracy on a large dataset.
Contribution
The paper presents AsmRAG, a novel assembly-level retrieval-augmented generation system that enhances malware detection interpretability and robustness over traditional classifiers.
Findings
Achieved 96% detection F1-score on 40k binaries.
Maintains robustness against metamorphic obfuscation.
Provides verifiable forensic evidence for malware analysis.
Abstract
Deep learning malware detectors achieve high classification accuracy but suffer from severe interpretability limitations, typically returning probabilistic verdicts that lack forensic context. We introduce AsmRAG, a framework performing malware analysis through Assembly-Level Retrieval-Augmented Generation. Unlike classifiers built on global statistical features, AsmRAG reformulates detection as an evidence-based retrieval task. The system uses a code-specialized Large Language Model (LLM) to analyze assembly functions and convert them into semantic embeddings. This process constructs a searchable knowledge base resilient to syntactic obfuscation. For inference, we propose a Density-Weighted Anchor Selection mechanism that isolates the primary unit of malicious logic within a binary to extract verifiable forensic evidence and resist evasion attempts. Testing on a curated dataset of 40k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
