SEMODS: A Validated Dataset of Open-Source Software Engineering Models
Alexandra Gonz\'alez, Xavier Franch, Silverio Mart\'inez-Fern\'andez

TL;DR
SEMODS is a curated, validated dataset of over 3,400 open-source models tailored for Software Engineering tasks, enabling better discovery, benchmarking, and analysis of AI models in SE.
Contribution
This paper introduces SEMODS, the first comprehensive, validated dataset of SE models from Hugging Face, linking models to SE tasks with standardized evaluation data.
Findings
Dataset contains 3,427 models with task annotations
Links models to software development lifecycle activities
Supports multiple applications like benchmarking and model discovery
Abstract
Integrating Artificial Intelligence into Software Engineering (SE) requires having a curated collection of models suited to SE tasks. With millions of models hosted on Hugging Face (HF) and new ones continuously being created, it is infeasible to identify SE models without a dedicated catalogue. To address this gap, we present SEMODS: an SE-focused dataset of 3,427 models extracted from HF, combining automated collection with rigorous validation through manual annotation and large language model assistance. Our dataset links models to SE tasks and activities from the software development lifecycle, offering a standardized representation of their evaluation results, and supporting multiple applications such as data analysis, model discovery, benchmarking, and model adaptation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Model-Driven Software Engineering Techniques
