Cracks in The Stack: Hidden Vulnerabilities and Licensing Risks in LLM Pre-Training Datasets
Mahmoud Jahanshahi, Audris Mockus

TL;DR
This paper introduces an automated method for curating high-quality source code datasets for training large language models, aiming to reduce bugs, vulnerabilities, and licensing issues in AI-generated code.
Contribution
It leverages version history of open-source projects to identify and filter out problematic code samples, improving dataset quality for safer and more compliant AI code generation.
Findings
17% of code versions have newer updates, with 17% of those fixing bugs.
2.36% of code fixes address known CVEs.
6,947 CVEs are associated with vulnerable code blobs in the dataset.
Abstract
A critical part of creating code suggestion systems is the pre-training of Large Language Models on vast amounts of source code and natural language text, often of questionable origin or quality. This may contribute to the presence of bugs and vulnerabilities in code generated by LLMs. While efforts to identify bugs at or after code generation exist, it is preferable to pre-train or fine-tune LLMs on curated, high-quality, and compliant datasets. The need for vast amounts of training data necessitates that such curation be automated, minimizing human intervention. We propose an automated source code autocuration technique that leverages the complete version history of open-source software projects to improve the quality of training data. This approach leverages the version history of all OSS projects to identify training data samples that have been modified or have undergone changes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property · Artificial Intelligence in Healthcare and Education
