Structure-Aware Fill-in-the-Middle Pretraining for Code

Linyuan Gong; Alvin Cheung; Mostafa Elhoushi; Sida Wang

arXiv:2506.00204·cs.CL·June 3, 2025

Structure-Aware Fill-in-the-Middle Pretraining for Code

Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, Sida Wang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AST-FIM, a novel pretraining method for code LLMs that uses ASTs to mask syntactic structures, improving fill-in-the-middle tasks and aligning better with real-world code editing patterns.

Contribution

AST-FIM leverages Abstract Syntax Trees for more coherent code pretraining, and the paper presents a new benchmark, Real-FIM-Eval, for evaluating fill-in-the-middle performance on real-world code edits.

Findings

01

AST-FIM outperforms standard FIM by up to 5 points on benchmarks.

02

Models trained with AST-FIM excel at real-world code editing tasks.

03

The approach benefits large-scale code models across multiple programming languages.

Abstract

Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

gonglinyuan/real_fim_eval
dataset· 80 dl
80 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Software Testing and Debugging Techniques