On Inter-dataset Code Duplication and Data Leakage in Large Language Models
Jos\'e Antonio Hern\'andez L\'opez, Boqi Chen, Mootez Saaz, Tushar, Sharma, D\'aniel Varr\'o

TL;DR
This paper investigates how code duplication across datasets can inflate the perceived performance of large language models in software engineering tasks, highlighting a potential evaluation bias.
Contribution
It empirically studies inter-dataset code duplication, demonstrating its impact on LLM evaluation and revealing vulnerabilities in current fine-tuning practices.
Findings
Inter-dataset code duplication can inflate LLM performance metrics.
Fine-tuning techniques influence the extent of evaluation bias.
Open-source models are also susceptible to data leakage from dataset overlaps.
Abstract
Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets during a pre-training phase, and subsequently refining on smaller, task-specific datasets as part of a fine-tuning phase. Problem statement. While intra-dataset code duplication examines the intersection between the training and test splits within a given dataset and has been addressed in prior research, inter-dataset code duplication, which gauges the overlap between different datasets, remains largely unexplored. If this phenomenon exists, it could compromise the integrity of LLM evaluations because of the inclusion of fine-tuning test samples that were already encountered during pre-training, resulting in inflated performance metrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
