Mastering the Craft of Data Synthesis for CodeLLMs

Meng Chen; Philip Arthur; Qianyu Feng; Cong Duy Vu Hoang; Yu-Heng; Hong; Mahdi Kazemi Moghaddam; Omid Nezami; Thien Nguyen; Gioacchino Tangari,; Duy Vu; Thanh Vu; Mark Johnson; Krishnaram Kenthapadi; Don Dharmasiri; Long; Duong; Yuan-Fang Li

arXiv:2411.00005·cs.SE·February 10, 2025

Mastering the Craft of Data Synthesis for CodeLLMs

Meng Chen, Philip Arthur, Qianyu Feng, Cong Duy Vu Hoang, Yu-Heng, Hong, Mahdi Kazemi Moghaddam, Omid Nezami, Thien Nguyen, Gioacchino Tangari,, Duy Vu, Thanh Vu, Mark Johnson, Krishnaram Kenthapadi, Don Dharmasiri, Long, Duong, Yuan-Fang Li

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper surveys recent data synthesis and filtering techniques for improving code-focused large language models, providing a taxonomy, discussing challenges, and guiding new researchers in the field.

Contribution

It offers a comprehensive taxonomy and analysis of recent advancements in data synthesis techniques for code LLMs, along with practical research guidance.

Findings

01

Highlights key challenges in data synthesis for code LLMs

02

Provides a taxonomy of synthesis and filtering techniques

03

Discusses future research directions

Abstract

Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenmengdx/awesome-data-synthesis-for-code-llm
noneOfficial

Videos

Mastering the Craft of Data Synthesis for CodeLLMs· underline

Taxonomy

TopicsSemantic Web and Ontologies · Engineering and Information Technology · Natural Language Processing Techniques

MethodsFocus