Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Amal Abed; Ivan Lukic; J\"org K.H. Franke; Frank Hutter

arXiv:2510.23208·cs.LG·October 28, 2025

Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Amal Abed, Ivan Lukic, J\"org K.H. Franke, Frank Hutter

PDF

1 Datasets

TL;DR

This paper introduces a synthetic data pipeline that generates diverse instruction-reasoning-code-test quadruplets to enhance LLM coding abilities, demonstrating improved performance on benchmarks.

Contribution

The authors develop a scalable pipeline for creating large, diverse, reasoning-aware coding datasets, significantly advancing LLM training resources.

Findings

01

Fine-tuning on the dataset improves coding benchmark performance.

02

Reasoning-aware data can substitute for model scaling.

03

Outperforms open-source models with the same data budget.

Abstract

Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources pair problems with solutions, but omit the intermediate thought process that guides coding. To close this gap, we present a scalable synthetic data generation pipeline that produces nearly 800k instruction-reasoning-code-test quadruplets. Each sample combines a task, a step-by-step reasoning trace, a working solution, and executable tests, enabling models to learn not just the what but also the how of problem solving. Our pipeline combines four key components: curated contest problems, web-mined content filtered by relevance classifiers, data expansion guided by reasoning patterns, and multi-stage execution-based validation. A genetic mutation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

AutoML-org/SyntheticCode-800K
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.