Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

Wenbo Guo; Zhongwen Chen; Zhengzi Xu; Chengwei Liu; Ming Kang; Shiwen Song; Chengyue Liu; Yijia Xu; Weisong Sun; Yang Liu

arXiv:2603.27549·cs.SE·March 31, 2026

Understanding NPM Malicious Package Detection: A Benchmark-Driven Empirical Analysis

Wenbo Guo, Zhongwen Chen, Zhengzi Xu, Chengwei Liu, Ming Kang, Shiwen Song, Chengyue Liu, Yijia Xu, Weisong Sun, Yang Liu

PDF

1 Repo

TL;DR

This paper presents a comprehensive benchmark and empirical analysis of NPM malicious package detection tools, revealing structural factors influencing their performance and proposing effective tool combinations.

Contribution

It introduces a large annotated dataset, evaluates multiple detection tools, and uncovers structural and behavioral insights behind detection effectiveness.

Findings

01

GuardDog achieves 93.32% F1 score, the best among evaluated tools.

02

Behavioral chains significantly improve malicious intent detection accuracy.

03

Strategic tool combinations can reach over 96% accuracy and 95% F1 score.

Abstract

The NPM ecosystem has become a primary target for software supply chain attacks, yet existing detection tools are evaluated in isolation on incompatible datasets, making cross-tool comparison unreliable. We conduct a benchmark-driven empirical analysis of NPM malware detection, building a dataset of 6,420 malicious and 7,288 benign packages annotated with 11 behavior categories and 8 evasion techniques, and evaluating 8 tools across 13 variants. Unlike prior work, we complement quantitative evaluation with source-code inspection of each tool to expose the structural mechanisms behind its performance. Our analysis reveals five key findings. Tool precision-recall positions are structurally determined by how each tool resolves the ambiguity between what code can do and what it intends to do, with GuardDog achieving the best balance at 93.32% F1. A single API call carries no directional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.