PeaTMOSS: Mining Pre-Trained Models in Open-Source Software

Wenxin Jiang; Jason Jones; Jerin Yasmin; Nicholas Synovic; Rajeev; Sashti; Sophie Chen; George K. Thiruvathukal; Yuan Tian; James C. Davis

arXiv:2310.03620·cs.SE·October 6, 2023

PeaTMOSS: Mining Pre-Trained Models in Open-Source Software

Wenxin Jiang, Jason Jones, Jerin Yasmin, Nicholas Synovic, Rajeev, Sashti, Sophie Chen, George K. Thiruvathukal, Yuan Tian, James C. Davis

PDF

Open Access 1 Repo

TL;DR

This paper introduces PeaTMOSS, a comprehensive dataset of pre-trained models and their usage in open-source software, enabling analysis of software engineering practices related to PTMs.

Contribution

The paper presents the first large-scale dataset linking pre-trained models with open-source projects and their usage, facilitating research on software engineering behaviors involving PTMs.

Findings

01

Dataset includes 281,638 PTMs and 27,270 projects.

02

Provides a mapping between PTMs and software projects.

03

Enables study of engineering practices around PTMs.

Abstract

Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

purduedualitylab/peatmoss-demos
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Machine Learning and Data Classification