Self-Bootstrapping Automated Program Repair: Using LLMs to Generate and Evaluate Synthetic Training Data for Bug Repair

David de-Fitero-Dominguez; Antonio Garcia-Cabot; Eva Garcia-Lopez

arXiv:2505.07372·cs.SE·March 31, 2026

Self-Bootstrapping Automated Program Repair: Using LLMs to Generate and Evaluate Synthetic Training Data for Bug Repair

David de-Fitero-Dominguez, Antonio Garcia-Cabot, Eva Garcia-Lopez

PDF

TL;DR

This paper introduces a self-bootstrapping approach using LLMs to generate and evaluate synthetic training data, significantly improving automated program repair across multiple languages and bug types.

Contribution

It presents a novel two-phase methodology for synthetic data generation and quality assessment, enhancing APR performance with less computational effort.

Findings

01

Synthetic dataset improved Top@1 prediction accuracy by 47%

02

Achieved statistically significant improvements over baseline systems

03

Validated approach across 12 programming languages and 13 bug categories

Abstract

This paper presents a novel methodology for enhancing Automated Program Repair (APR) through synthetic data generation utilizing Large Language Models (LLMs). Current APR systems are constrained by the limited availability of high-quality training data encompassing diverse bug types across multiple programming languages. The proposed approach addresses this limitation through a two-phase process: a synthetic sample generation followed by a rigorous quality assessment. Multiple state-of-the-art LLMs were employed to generate approximately 30,000 paired examples of buggy and fixed code across 12 programming languages and 13 bug categories. Subsequently, these samples underwent cross-model evaluation against five criteria: correctness, code quality, security, performance, and completeness. Experimental evaluation on the VulRepair test set dataset showed statistically significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.