SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion

Dongjun Yu; Xiao Yan; Zhenrui Li; Jipeng Xiao; Haochuan He; Yongda Yu; Hao Zhang; Guoping Rong; Xiaobo Huang

arXiv:2508.15495·cs.SE·September 18, 2025

SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion

Dongjun Yu, Xiao Yan, Zhenrui Li, Jipeng Xiao, Haochuan He, Yongda Yu, Hao Zhang, Guoping Rong, Xiaobo Huang

PDF

Open Access

TL;DR

SynthCoder is a novel approach that combines diverse data augmentation, contextual enrichment, and a two-stage training process to significantly improve code completion performance of LLMs, especially in repository-level scenarios.

Contribution

The paper introduces SynthCoder, integrating industry practices and a two-stage training method to achieve state-of-the-art code completion results and address common model issues.

Findings

01

Outperforms existing models on multiple repository-level benchmarks.

02

Effectively mitigates code repetition issues in LLMs.

03

Enhances code completion accuracy with enriched contextual data.

Abstract

Code completion is a prominent application of Large Language Models (LLMs) in software engineering. Due to the near real-time response requirements of this task, base models with small to medium-sized parameters are typically employed, supplemented by various optimization and post-training techniques. However, these optimization methods often have trade-offs, leading to a seesaw effect where performance improvements on certain datasets or metrics are accompanied by degradations on others -- sometimes even falling below the baseline model's performance. This paper proposes SynthCoder, a model that integrates leading industry practices to achieve state-of-the-art performance on the Fill-in-the-Middle (FIM) code completion task. In specific, we first construct a diverse dataset by combining Abstract Syntax Tree (AST) node extraction with heuristics that simulate developer behavior. Then we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Testing and Debugging Techniques