Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification
Ahmed Ryan, Ibrahim Khalil, Abdullah Al Jahid, Md Erfan, Sungbin Park, Akond Ashfaque Ur Rahman, and Md Rayhanur Rahman

TL;DR
This study systematically evaluates 13 LLMs for malicious package detection in open-source repositories, revealing high effectiveness at the package level but significant challenges in identifying specific malicious indicators.
Contribution
It provides a comprehensive assessment of LLMs' capabilities in malicious package detection and highlights the granularity gap between binary detection and indicator identification.
Findings
GPT-4.1 achieves near-perfect binary detection (F1 ≈ 0.99).
Detection accuracy drops by about 41% when identifying specific malicious indicators.
Model size and context width have negligible impact on detection performance.
Abstract
The prevalence of malicious packages in open-source repositories, such as PyPI, poses a critical threat to the software supply chain. While Large Language Models (LLMs) have emerged as a promising tool for automated security tasks, their effectiveness in detecting malicious packages and indicators remains underexplored. This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages. Using a curated dataset of 4,070 packages (3,700 benign and 370 malicious), we evaluate model performance across two tasks: binary classification (package detection) and multi-label classification (identification of specific malicious indicators). We further investigate the impact of prompting strategies, temperature settings, and model specifications on detection accuracy. We find a significant "granularity gap" in LLMs' capabilities. While GPT-4.1 achieves near-perfect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Spam and Phishing Detection
