From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Wasi Uddin Ahmad; Somshubra Majumdar; Boris Ginsburg

arXiv:2505.18789·cs.SE·June 11, 2025

From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg

PDF

Open Access

TL;DR

This paper examines whether raw instruction-tuned code LLM outputs are sufficient for fill-in-the-middle code generation, highlighting the importance of post-processing and demonstrating improvements with fine-tuning on specific benchmarks.

Contribution

The study shows that supervised fine-tuning enhances LLM performance in FIM tasks, reducing the need for post-processing across multiple programming languages.

Findings

01

Fine-tuning improves code integration in FIM tasks.

02

Post-processing remains necessary for random span generation.

03

Models perform better without post-processing when middle spans are complete lines.

Abstract

Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExperimental Learning in Engineering · Digital Rights Management and Security · Advanced Data Storage Technologies