SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion
Dongjun Yu, Xiao Yan, Zhenrui Li, Jipeng Xiao, Haochuan He, Yongda Yu, Hao Zhang, Guoping Rong, Xiaobo Huang

TL;DR
SynthCoder is a novel approach that combines diverse data augmentation, contextual enrichment, and a two-stage training process to significantly improve code completion performance of LLMs, especially in repository-level scenarios.
Contribution
The paper introduces SynthCoder, integrating industry practices and a two-stage training method to achieve state-of-the-art code completion results and address common model issues.
Findings
Outperforms existing models on multiple repository-level benchmarks.
Effectively mitigates code repetition issues in LLMs.
Enhances code completion accuracy with enriched contextual data.
Abstract
Code completion is a prominent application of Large Language Models (LLMs) in software engineering. Due to the near real-time response requirements of this task, base models with small to medium-sized parameters are typically employed, supplemented by various optimization and post-training techniques. However, these optimization methods often have trade-offs, leading to a seesaw effect where performance improvements on certain datasets or metrics are accompanied by degradations on others -- sometimes even falling below the baseline model's performance. This paper proposes SynthCoder, a model that integrates leading industry practices to achieve state-of-the-art performance on the Fill-in-the-Middle (FIM) code completion task. In specific, we first construct a diverse dataset by combining Abstract Syntax Tree (AST) node extraction with heuristics that simulate developer behavior. Then we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Testing and Debugging Techniques
