Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification

Ahmed Ryan; Ibrahim Khalil; Abdullah Al Jahid; Md Erfan; Sungbin Park; Akond Ashfaque Ur Rahman; and Md Rayhanur Rahman

arXiv:2602.16304·cs.CR·March 3, 2026

Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification

Ahmed Ryan, Ibrahim Khalil, Abdullah Al Jahid, Md Erfan, Sungbin Park, Akond Ashfaque Ur Rahman, and Md Rayhanur Rahman

PDF

Open Access

TL;DR

This study systematically evaluates 13 LLMs for malicious package detection in open-source repositories, revealing high effectiveness at the package level but significant challenges in identifying specific malicious indicators.

Contribution

It provides a comprehensive assessment of LLMs' capabilities in malicious package detection and highlights the granularity gap between binary detection and indicator identification.

Findings

01

GPT-4.1 achieves near-perfect binary detection (F1 ≈ 0.99).

02

Detection accuracy drops by about 41% when identifying specific malicious indicators.

03

Model size and context width have negligible impact on detection performance.

Abstract

The prevalence of malicious packages in open-source repositories, such as PyPI, poses a critical threat to the software supply chain. While Large Language Models (LLMs) have emerged as a promising tool for automated security tasks, their effectiveness in detecting malicious packages and indicators remains underexplored. This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages. Using a curated dataset of 4,070 packages (3,700 benign and 370 malicious), we evaluate model performance across two tasks: binary classification (package detection) and multi-label classification (identification of specific malicious indicators). We further investigate the impact of prompting strategies, temperature settings, and model specifications on detection accuracy. We find a significant "granularity gap" in LLMs' capabilities. While GPT-4.1 achieves near-perfect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Software Engineering Research · Spam and Phishing Detection