Towards a Classification of Open-Source ML Models and Datasets for Software Engineering
Alexandra Gonz\'alez, Xavier Franch, David Lo, Silverio, Mart\'inez-Fern\'andez

TL;DR
This paper classifies open-source ML models and datasets for software engineering, revealing trends, dominant tasks like code generation, and the recent surge in SE-specific PTMs since 2023 Q2.
Contribution
It introduces an SE-oriented classification scheme for PTMs and datasets, applied to Hugging Face, and analyzes their evolution and focus areas over time.
Findings
Code generation is the most common SE task among PTMs and datasets.
Most PTMs and datasets target software development over software management.
There has been a significant increase in SE PTMs since 2023 Q2.
Abstract
Background: Open-Source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks, yet these resources lack a classification tailored to Software Engineering (SE) needs. Aims: We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time. Method: We conducted a repository mining study. We started with a systematically gathered database of PTMs and datasets from the HF API. Our selection was refined by analyzing model and dataset cards and metadata, such as tags, and confirming SE relevance using Gemini 1.5 Pro. All analyses are replicable, with a publicly accessible replication package. Results: The most common SE task among PTMs and datasets is code generation, with a primary focus on software development and limited attention to software…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Simulation Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Focus
