Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks
Amal Abed, Ivan Lukic, J\"org K.H. Franke, Frank Hutter

TL;DR
This paper introduces a synthetic data pipeline that generates diverse instruction-reasoning-code-test quadruplets to enhance LLM coding abilities, demonstrating improved performance on benchmarks.
Contribution
The authors develop a scalable pipeline for creating large, diverse, reasoning-aware coding datasets, significantly advancing LLM training resources.
Findings
Fine-tuning on the dataset improves coding benchmark performance.
Reasoning-aware data can substitute for model scaling.
Outperforms open-source models with the same data budget.
Abstract
Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources pair problems with solutions, but omit the intermediate thought process that guides coding. To close this gap, we present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets. Each sample combines a task, a step-by-step reasoning trace, a working solution, and executable tests, enabling models to learn not just the what but also the how of problem solving. Our pipeline combines four key components: curated contest problems, web-mined content filtered by relevance classifiers, data expansion guided by reasoning patterns, and multi-stage execution-based validation. A genetic mutation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
