A Methodological Framework for LLM-Based Mining of Software Repositories

Vincenzo De Martino; Joel Casta\~no; Fabio Palomba; Xavier Franch; Silverio Mart\'inez-Fern\'andez

arXiv:2508.02233·cs.SE·August 12, 2025

A Methodological Framework for LLM-Based Mining of Software Repositories

Vincenzo De Martino, Joel Casta\~no, Fabio Palomba, Xavier Franch, Silverio Mart\'inez-Fern\'andez

PDF

Open Access

TL;DR

This paper introduces PRIMES 2.0, a comprehensive framework for integrating Large Language Models into Mining Software Repositories, enhancing methodological rigor, transparency, and reproducibility in this emerging research area.

Contribution

It identifies 15 methodological approaches, 9 threats, and 25 mitigation strategies, and presents a structured empirical framework for LLM-based MSR research.

Findings

01

15 methodological approaches identified

02

9 main threats to empirical rigor

03

25 mitigation strategies proposed

Abstract

Large Language Models (LLMs) are increasingly used in software engineering research, offering new opportunities for automating repository mining tasks. However, despite their growing popularity, the methodological integration of LLMs into Mining Software Repositories (MSR) remains poorly understood. Existing studies tend to focus on specific capabilities or performance benchmarks, providing limited insight into how researchers utilize LLMs across the full research pipeline. To address this gap, we conduct a mixed-method study that combines a rapid review and questionnaire survey in the field of LLM4MSR. We investigate (1) the approaches and (2) the threats that affect the empirical rigor of researchers involved in this field. Our findings reveal 15 methodological approaches, nine main threats, and 25 mitigation strategies. Building on these findings, we present PRIMES 2.0, a refined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Scientific Computing and Data Management