TL;DR
This paper presents a comprehensive benchmark and empirical analysis of NPM malicious package detection tools, revealing structural factors influencing their performance and proposing effective tool combinations.
Contribution
It introduces a large annotated dataset, evaluates multiple detection tools, and uncovers structural and behavioral insights behind detection effectiveness.
Findings
GuardDog achieves 93.32% F1 score, the best among evaluated tools.
Behavioral chains significantly improve malicious intent detection accuracy.
Strategic tool combinations can reach over 96% accuracy and 95% F1 score.
Abstract
The NPM ecosystem has become a primary target for software supply chain attacks, yet existing detection tools are evaluated in isolation on incompatible datasets, making cross-tool comparison unreliable. We conduct a benchmark-driven empirical analysis of NPM malware detection, building a dataset of 6,420 malicious and 7,288 benign packages annotated with 11 behavior categories and 8 evasion techniques, and evaluating 8 tools across 13 variants. Unlike prior work, we complement quantitative evaluation with source-code inspection of each tool to expose the structural mechanisms behind its performance. Our analysis reveals five key findings. Tool precision-recall positions are structurally determined by how each tool resolves the ambiguity between what code can do and what it intends to do, with GuardDog achieving the best balance at 93.32% F1. A single API call carries no directional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
