TL;DR
This systematic review explores how training data quality issues influence code generation failures in large language models, proposing a taxonomy, causal framework, and discussing detection and mitigation techniques.
Contribution
It introduces a unified taxonomy and causal framework linking training data issues to code quality problems in LLMs, and reviews current detection and mitigation strategies.
Findings
Training data imperfections significantly impact code quality in LLMs.
Shift from post-generation filtering to proactive data governance.
Identified open challenges and future research directions.
Abstract
Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
