SnipGen: A Mining Repository Framework for Evaluating LLMs for Code
Daniel Rodriguez-Cardenas, Alejandro Velasco, Denys Poshyvanyk

TL;DR
SnipGen is a framework that mines GitHub data to create robust testbeds for evaluating large language models' code generation capabilities, addressing data contamination issues in software engineering research.
Contribution
It introduces a novel mining framework and dataset that enable more accurate and nuanced evaluation of LLMs for code tasks, with prompt engineering techniques.
Findings
Mined approximately 227K data points from GitHub commits.
Developed prompt templates for nuanced LLM assessment.
Provided a dataset and methodology for rigorous evaluation.
Abstract
Language Models (LLMs), such as transformer-based neural networks trained on billions of parameters, have become increasingly prevalent in software engineering (SE). These models, trained on extensive datasets that include code repositories, exhibit remarkable capabilities for SE tasks. However, evaluating their effectiveness poses significant challenges, primarily due to the potential overlap between the datasets used for training and those employed for evaluation. To address this issue, we introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation. SnipGen aims to mitigate data contamination by generating robust testbeds and crafting tailored data points to assist researchers and practitioners in evaluating LLMs for code-related tasks. In our exploratory study, SnipGen mined approximately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
