A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

Haya Samaana (1); Diego Elias Costa (2); Emad Shihab (2); Ahmad; Abdellatif (3) ((1) An Najah National University; Nablus; Palestine; (2); Concordia University; Montreal; Quebec; Canada; (3) University of Calgary,; Calgary; Alberta; Canada)

arXiv:2412.05259·cs.SE·December 9, 2024

A Machine Learning-Based Approach For Detecting Malicious PyPI Packages

Haya Samaana (1), Diego Elias Costa (2), Emad Shihab (2), Ahmad, Abdellatif (3) ((1) An Najah National University, Nablus, Palestine, (2), Concordia University, Montreal, Quebec, Canada, (3) University of Calgary,, Calgary, Alberta, Canada)

PDF

Open Access

TL;DR

This paper presents a machine learning-based method that uses static analysis of package metadata, code, and textual features to effectively detect malicious packages in the PyPI ecosystem, enhancing security and reducing manual review.

Contribution

It introduces a novel data-driven approach employing a stacking ensemble classifier for identifying malicious Python packages, achieving high accuracy in real-world evaluations.

Findings

01

Achieved an F1-measure of 0.94 in detecting malicious packages

02

The approach can be integrated into existing vetting pipelines

03

It can flag entire packages, not just individual malicious functions

Abstract

Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Digital Media Forensic Detection · Digital and Cyber Forensics