UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance
Yichuan Ma, Yunfan Shao, Peiji Li, Demin Song, Qipeng Guo, Linyang Li,, Xipeng Qiu, Kai Chen

TL;DR
UnitCoder introduces a scalable pipeline that uses model-generated unit tests to guide and validate code synthesis, significantly improving the quality and diversity of training data for large language models in code generation tasks.
Contribution
It presents a novel approach leveraging unit tests for guiding and validating code synthesis, enhancing data quality for training large language models.
Findings
Models fine-tuned on synthetic data outperform baselines.
Significant success rate improvements on Python benchmarks.
Generated dataset contains over 500K verifiable programs.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale pre-training data and (ii) synthesizing instruction data through prompt engineering with powerful models. While pre-training data faces quality consistency issues, instruction-based synthesis suffers from limited instruction diversity and inherent biases of LLMs. To address this gap, we introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to both guide and validate the code generation process. Combined with large-scale package-based retrieval from pre-training corpus, we generate a dataset of 500K+ verifiable programs containing diverse API calls. Evaluations on multiple Python benchmarks (BigCodeBench, HumanEval, MBPP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Real-time simulation and control systems · Parallel Computing and Optimization Techniques
MethodsFocus
