X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang

TL;DR
This paper introduces X-Coder, a synthetic data-driven approach that significantly improves competitive programming performance of code language models by using high-quality, domain-specific synthetic tasks, solutions, and tests.
Contribution
The paper develops a novel feature-based synthetic data generation method with domain-specific evolution and dual-verification, enabling training of competitive models without real-world data.
Findings
X-Coder-7B outperforms larger models trained on real data.
Synthetic data quality critically impacts model performance.
Domain-specific evolution enhances synthetic task solvability.
Abstract
Competitive programming poses a significant challenge for Code LLMs. While recent models have shown promise, they heavily rely on finite real-world data, raising concerns about scalability and contamination. In this paper, we investigate a critical question: Can we elevate models to expert-level reasoning performance using fully synthetic data? In response, we first observe that off-the-shelf synthesis methods yield suboptimal results in this domain. To address this, we systematically investigate the key factors governing synthetic data quality. Leveraging these findings, we significantly advance the feature-based synthesis paradigm via domain-specific evolution and a dual-verification strategy, promoting task solvability, solution correctness, and test accuracy. Using this high-quality synthetic data, we train the X-Coder model series under an SFT-then-RL paradigm. X-Coder-7B shows…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper presents a complete pipeline covering all aspects of synthetic data generation, from tasks to solutions to test cases, with thoughtful verification strategies. - X-Coder achieves impressive performance, outperforming larger 14B models despite having only 7B parameters, demonstrating the effectiveness of the synthetic approach. - Extensive ablations examining verification, CoT length, task styles, data selection, and test generation methods. Detailed analysis of scaling laws showing w
- Unclear how many tasks have incorrect "golden" solutions despite verification. The dual-verification strategy's actual error rate is not quantified - Strong reliance on EpiCoder's feature-based framework. Significant performance gains may come from using stronger teacher models (GPT-o3-mini, Deepseek-R1-0528) rather than methodological improvements - From Table 4, one can see that SFT or RL is increasing the number of no-code solutions from the base model, any reason why that is happening ?
- The authors describe the stages (feature-tree evolution, problem formulation, solution/test synthesis, golden-selection) in a clear and detailed manner. - A comprehensive set of experiments was conducted on model training and performance. - Insightful experiment design around problem style: For example, the authors examine how the style in which a problem is specified (e.g., “competitive” vs. “educational” style) affects model learning and how picking problems whose solutions require longer re
**Overall** - The idea of using concepts from seed problems and evolving them into a larger bank of problems is not entirely novel in the domain of synthetic code-data generation. The authors do not sufficiently situate their work in relation to recent pipelines such as SelfCodeAlign (Wei et al., 2024a) or CodeEvo (Sun, Qiushi et al., 2025) etc. A clear performance comparison (or final model accuracy / generation-cost comparison) with those pipelines is missing. **Programming-Problem Generation
The work is well-executed (the various prompt pipelines), and the results look strong. The main novelty of the paper is the synthetic code pipelines, where the authors detail the prompt used in the Appendix to demonstrate how the features (algorithm/data structure) are extracted, grow, and generate a problem statement. Additional experiments are conducted to show the scaling effect of the dataset. Some extra ablations are also conducted, such as the tool-based generation vs prompt-based generati
Since the manuscript does not claim algorithmic novelty (which is not inherently an issue) and the major contribution is the synthetic data pipeline and the artifact. My concerns for not recommending an acceptance lie in that I'm not fully convinced of the resulting dataset's superiority and the clarity of the data construction process. 1. **Details of the pipeline stats**: Could the authors give details about the stats of the dataset in each step (not just the final token stats for the final d
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Evolutionary Algorithms and Applications
