npm-follower: A Complete Dataset Tracking the NPM Ecosystem
Donald Pinckney, Federico Cassano, Arjun Guha, Jonathan Bell

TL;DR
npm-follower is a comprehensive dataset that archives all NPM packages and versions, including deleted ones, enabling researchers to analyze the entire ecosystem's structure, security, and malware concerns.
Contribution
It introduces a novel crawling architecture that captures and retains all package data from NPM, including deleted versions, addressing limitations of prior datasets.
Findings
Includes over 35 million package versions
Captures data on deleted packages and versions
Grows at about 1 million versions per month
Abstract
Software developers typically rely upon a large network of dependencies to build their applications. For instance, the NPM package repository contains over 3 million packages and serves tens of billions of downloads weekly. Understanding the structure and nature of packages, dependencies, and published code requires datasets that provide researchers with easy access to metadata and code of packages. However, prior work on NPM dataset construction typically has two limitations: 1) only metadata is scraped, and 2) packages or versions that are deleted from NPM can not be scraped. Over 330,000 versions of packages were deleted from NPM between July 2022 and May 2023. This data is critical for researchers as it often pertains to important questions of security and malware. We present npm-follower, a dataset and crawling architecture which archives metadata and code of all packages and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Malware Detection Techniques · Software Engineering Research
