Empirical Evaluation of Large Language Models in Automated Program Repair

Jiajun Sun; Fengjie Li; Xinzhu Qi; Hongyu Zhang; Jiajun Jiang

arXiv:2506.13186·cs.SE·June 17, 2025

Empirical Evaluation of Large Language Models in Automated Program Repair

Jiajun Sun, Fengjie Li, Xinzhu Qi, Hongyu Zhang, Jiajun Jiang

PDF

Open Access

TL;DR

This study empirically evaluates large language models for automated program repair across multiple languages and scenarios, revealing that model specialization and prompt design significantly influence repair effectiveness.

Contribution

It provides a comprehensive analysis of modern large-scale LLMs in APR, highlighting the impact of model size, specialization, and prompting strategies on repair performance.

Findings

01

Model specialization can outperform larger general models.

02

Repair performance does not increase linearly with model size.

03

Correct patches often appear early in generation.

Abstract

The increasing prevalence of software bugs has made automated program repair (APR) a key research focus. Large language models (LLMs) offer new opportunities for APR, but existing studies mostly rely on smaller, earlier-generation models and Java benchmarks. The repair capabilities of modern, large-scale LLMs across diverse languages and scenarios remain underexplored. To address this, we conduct a comprehensive empirical study of four open-source LLMs, CodeLlama, LLaMA, StarCoder, and DeepSeek-Coder, spanning 7B to 33B parameters, diverse architectures, and purposes. We evaluate them across two bug scenarios (enterprise-grades and algorithmic), three languages (Java, C/C++, Python), and four prompting strategies, analyzing over 600K generated patches on six benchmarks. Key findings include: (1) model specialization (e.g., CodeLlama) can outperform larger general-purpose models (e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Software System Performance and Reliability · Advanced Data Storage Technologies