Mastering the Craft of Data Synthesis for CodeLLMs
Meng Chen, Philip Arthur, Qianyu Feng, Cong Duy Vu Hoang, Yu-Heng, Hong, Mahdi Kazemi Moghaddam, Omid Nezami, Thien Nguyen, Gioacchino Tangari,, Duy Vu, Thanh Vu, Mark Johnson, Krishnaram Kenthapadi, Don Dharmasiri, Long, Duong, Yuan-Fang Li

TL;DR
This paper surveys recent data synthesis and filtering techniques for improving code-focused large language models, providing a taxonomy, discussing challenges, and guiding new researchers in the field.
Contribution
It offers a comprehensive taxonomy and analysis of recent advancements in data synthesis techniques for code LLMs, along with practical research guidance.
Findings
Highlights key challenges in data synthesis for code LLMs
Provides a taxonomy of synthesis and filtering techniques
Discusses future research directions
Abstract
Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data synthesis and filtering techniques have been widely adopted and shown to be highly effective in this context. In this paper, we present a focused survey and taxonomy of these techniques, emphasizing recent advancements. We highlight key challenges, explore future research directions, and offer practical guidance for new researchers entering the field.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Engineering and Information Technology · Natural Language Processing Techniques
MethodsFocus
