Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Kaifeng He; Xiaojun Zhang; Peiliang Cai; Mingwei Liu; Yanlin Wang; Chong Wang; Kaifeng Huang; Bihuan Chen; Xin Peng; and Zibin Zheng

arXiv:2605.05267·cs.SE·May 8, 2026

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

Kaifeng He, Xiaojun Zhang, Peiliang Cai, Mingwei Liu, Yanlin Wang, Chong Wang, Kaifeng Huang, Bihuan Chen, Xin Peng, and Zibin Zheng

PDF

1 Repo

TL;DR

This systematic review explores how training data quality issues influence code generation failures in large language models, proposing a taxonomy, causal framework, and discussing detection and mitigation techniques.

Contribution

It introduces a unified taxonomy and causal framework linking training data issues to code quality problems in LLMs, and reviews current detection and mitigation strategies.

Findings

01

Training data imperfections significantly impact code quality in LLMs.

02

Shift from post-generation filtering to proactive data governance.

03

Identified open challenges and future research directions.

Abstract

Large language models (LLMs) frequently generate defective outputs in code generation tasks, ranging from logical bugs to security vulnerabilities. While these generation failures are often treated as model-level limitations, empirical evidence increasingly traces their root causes to imperfections within the training corpora. Yet, the specific mechanisms linking training data quality issues to generated code quality issues remain largely unmapped. This paper presents a systematic literature review of 114 primary studies to investigate how training data quality issues propagate into code generation. We establish a unified taxonomy that categorizes generated code quality issues across nine dimensions and training data quality issues into code and non-code attributes. Based on this taxonomy, we formalize a causal framework detailing 18 typical propagation mapping mechanisms. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SYSUSELab/From-Data-to-Code
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.