A Machine Learning-Based Approach For Detecting Malicious PyPI Packages
Haya Samaana (1), Diego Elias Costa (2), Emad Shihab (2), Ahmad, Abdellatif (3) ((1) An Najah National University, Nablus, Palestine, (2), Concordia University, Montreal, Quebec, Canada, (3) University of Calgary,, Calgary, Alberta, Canada)

TL;DR
This paper presents a machine learning-based method that uses static analysis of package metadata, code, and textual features to effectively detect malicious packages in the PyPI ecosystem, enhancing security and reducing manual review.
Contribution
It introduces a novel data-driven approach employing a stacking ensemble classifier for identifying malicious Python packages, achieving high accuracy in real-world evaluations.
Findings
Achieved an F1-measure of 0.94 in detecting malicious packages
The approach can be integrated into existing vetting pipelines
It can flag entire packages, not just individual malicious functions
Abstract
Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Digital Media Forensic Detection · Digital and Cyber Forensics
