Case2Code: Scalable Synthetic Data for Code Generation

Yunfan Shao; Linyang Li; Yichuan Ma; Peiji Li; Demin Song; Qinyuan; Cheng; Shimin Li; Xiaonan Li; Pengyu Wang; Qipeng Guo; Hang Yan; Xipeng Qiu,; Xuanjing Huang; Dahua Lin

arXiv:2407.12504·cs.CL·February 11, 2025

Case2Code: Scalable Synthetic Data for Code Generation

Yunfan Shao, Linyang Li, Yichuan Ma, Peiji Li, Demin Song, Qinyuan, Cheng, Shimin Li, Xiaonan Li, Pengyu Wang, Qipeng Guo, Hang Yan, Xipeng Qiu,, Xuanjing Huang, Dahua Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces Case2Code, a scalable method for generating synthetic code data through inductive inference, which enhances code LLM training and evaluation by leveraging input-output examples and program execution.

Contribution

Proposes the Case2Code task for scalable synthetic data generation, enabling improved training and evaluation of code LLMs through inductive inference and program execution.

Findings

01

Models trained with Case2Code data outperform baselines on code tasks.

02

Synthetic data improves model generalization to diverse coding scenarios.

03

Case2Code demonstrates the potential of large-scale synthetic data for code LLMs.

Abstract

Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs, which can be challenging to scale due to the dependence on a teacher model and high generation costs. In this paper, we focus on synthesizing code data at scale and propose a \textbf{Case2Code} task by exploiting the expressiveness and correctness of programs. \textbf{Case2Code} is an inductive inference task that aims to infer underlying code implementations by observing input-output examples or program behaviors, By incorporating LLMs to generate program inputs, and executing the program with these inputs to obtain the program outputs, we can synthesize diverse and high-quality \textbf{Case2Code} data at scale for training and evaluating code LLMs. Experimental results show that case-to-code induction is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

choosewhatulike/case2code
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical and Computational Modeling

MethodsSparse Evolutionary Training · Focus