An Analysis of Malicious Packages in Open-Source Software in the Wild

Xiaoyan Zhou; Ying Zhang; Wenjia Niu; Jiqiang Liu; Haining; Wang; Qiang Li

arXiv:2404.04991·cs.CR·April 18, 2025·2 cites

An Analysis of Malicious Packages in Open-Source Software in the Wild

Xiaoyan Zhou, Ying Zhang, Wenjia Niu, Jiqiang Liu, Haining, Wang, Qiang Li

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper constructs a large dataset of malicious open-source packages, analyzes malware diversity and reuse, and highlights the importance of diverse data sources and security reports for understanding OSS malware threats.

Contribution

It introduces the largest malware dataset for OSS, proposes a knowledge graph for malware analysis, and provides insights into malware reuse, dependency hiding, and data source importance.

Findings

01

Low malware diversity due to code reuse

02

Dependency-hidden malware has shorter active periods

03

Security reports are crucial for malware context understanding

Abstract

The open-source software (OSS) ecosystem suffers from security threats caused by malware.However, OSS malware research has three limitations: a lack of high-quality datasets, a lack of malware diversity, and a lack of attack campaign contexts. In this paper, we first build the largest dataset of 24,356 malicious packages from online sources, then propose a knowledge graph to represent the OSS malware corpus and conduct malware analysis in the wild.Our main findings include (1) it is essential to collect malicious packages from various online sources because their data overlapping degrees are small;(2) despite the sheer volume of malicious packages, many reuse similar code, leading to a low diversity of malware;(3) only 28 malicious packages were repeatedly hidden via dependency libraries of 1,354 malicious packages, and dependency-hidden malware has a shorter active time;(4) security…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

datadog/malicious-software-packages-dataset
noneOfficial

Models

🤗
schirrmacher/malwi
model· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Digital and Cyber Forensics · Network Security and Intrusion Detection