Case2Code: Scalable Synthetic Data for Code Generation
Yunfan Shao, Linyang Li, Yichuan Ma, Peiji Li, Demin Song, Qinyuan, Cheng, Shimin Li, Xiaonan Li, Pengyu Wang, Qipeng Guo, Hang Yan, Xipeng Qiu,, Xuanjing Huang, Dahua Lin

TL;DR
This paper introduces Case2Code, a scalable method for generating synthetic code data through inductive inference, which enhances code LLM training and evaluation by leveraging input-output examples and program execution.
Contribution
Proposes the Case2Code task for scalable synthetic data generation, enabling improved training and evaluation of code LLMs through inductive inference and program execution.
Findings
Models trained with Case2Code data outperform baselines on code tasks.
Synthetic data improves model generalization to diverse coding scenarios.
Case2Code demonstrates the potential of large-scale synthetic data for code LLMs.
Abstract
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs, which can be challenging to scale due to the dependence on a teacher model and high generation costs. In this paper, we focus on synthesizing code data at scale and propose a \textbf{Case2Code} task by exploiting the expressiveness and correctness of programs. \textbf{Case2Code} is an inductive inference task that aims to infer underlying code implementations by observing input-output examples or program behaviors, By incorporating LLMs to generate program inputs, and executing the program with these inputs to obtain the program outputs, we can synthesize diverse and high-quality \textbf{Case2Code} data at scale for training and evaluating code LLMs. Experimental results show that case-to-code induction is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling
MethodsSparse Evolutionary Training · Focus
