PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software
Wenxin Jiang, Jerin Yasmin, Jason Jones, Nicholas Synovic, Jiashen, Kuo, Nathaniel Bielanski, Yuan Tian, George K. Thiruvathukal, James C. Davis

TL;DR
This paper introduces PeaTMOSS, a comprehensive dataset documenting pre-trained models and their usage in open-source software, enabling analysis of the PTM supply chain and its impact.
Contribution
The paper presents the first large-scale dataset of PTMs, their metadata, and downstream applications, along with automated extraction methods and initial supply chain analysis.
Findings
PTMs are increasingly used in open-source projects.
Common shortcomings exist in PTM documentation.
Inconsistencies in software licenses are prevalent.
Abstract
The development and training of deep learning models have become increasingly costly and complex. Consequently, software engineers are adopting pre-trained models (PTMs) for their downstream applications. The dynamics of the PTM supply chain remain largely unexplored, signaling a clear need for structured datasets that document not only the metadata but also the subsequent applications of these models. Without such data, the MSR community cannot comprehensively understand the impact of PTM adoption and reuse. This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with 28,575 open-source software repositories from GitHub that utilize these models. Additionally, the dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management
