PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages
Kai Gao, Weiwei Xu, Wenhao Yang, Minghui Zhou

TL;DR
PyRadar is a comprehensive framework that improves retrieval and validation of source code repository information for PyPI packages by combining metadata analysis, machine learning validation, and source code matching.
Contribution
This paper introduces PyRadar, a novel framework that significantly enhances repository information retrieval accuracy for PyPI packages beyond existing tools.
Findings
Metadata-based retrieval covers 72.1% of packages.
Machine learning validation achieves an AUC of 0.995.
Source code matching retrieves 90.2% of repositories with 97% accuracy.
Abstract
A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Digital and Cyber Forensics · Scientific Computing and Data Management
