PeaTMOSS: Mining Pre-Trained Models in Open-Source Software
Wenxin Jiang, Jason Jones, Jerin Yasmin, Nicholas Synovic, Rajeev, Sashti, Sophie Chen, George K. Thiruvathukal, Yuan Tian, James C. Davis

TL;DR
This paper introduces PeaTMOSS, a comprehensive dataset of pre-trained models and their usage in open-source software, enabling analysis of software engineering practices related to PTMs.
Contribution
The paper presents the first large-scale dataset linking pre-trained models with open-source projects and their usage, facilitating research on software engineering behaviors involving PTMs.
Findings
Dataset includes 281,638 PTMs and 27,270 projects.
Provides a mapping between PTMs and software projects.
Enables study of engineering practices around PTMs.
Abstract
Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Machine Learning and Data Classification
