How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Johin Johny Arimbur

TL;DR
This study demonstrates that iterative self-repair significantly improves code generation accuracy across various large language models and benchmarks, highlighting the effectiveness of prompting strategies and model architectures.
Contribution
It provides the first comprehensive comparison of self-repair across multiple models, architectures, and benchmarks, showing modern instruction-tuned models succeed with prompting alone.
Findings
Self-repair universally improves pass rates by up to 30 percentage points.
Gemini 2.5 Flash achieves the highest final pass rates of 96.3% on HumanEval.
Most repair gains occur within the first two attempts.
Abstract
Large language models frequently fail to produce correct code on their first attempt, yet most benchmarks evaluate them in a single-shot setting. We investigate iterative self-repair (feeding execution errors back to the model for correction) across seven models spanning three families and both open-weight and proprietary providers: Llama 3.1 8B, Llama 3.3 70B, Llama 4 Scout (MoE, 16 experts), Llama 4 Maverick (MoE, 128 experts), Qwen3 32B, Gemini 2.5 Flash, and Gemini 2.5 Pro. On HumanEval (164 problems) and MBPP Sanitized (257 problems) with up to five attempts, self-repair universally improves pass rates: +4.9 to +17.1 pp on HumanEval and +16.0 to +30.0 pp on MBPP. Gemini 2.5 Flash achieves the highest final pass rates (96.3% HumanEval, 93.8% MBPP). Most gains concentrate in the first two rounds.Error-type analysis shows assertion errors (logical mistakes) are the hardest to repair…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
