HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

Fazle Rabbi; Jinqiu Yang

arXiv:2605.02215·cs.SE·May 11, 2026

HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

Fazle Rabbi, Jinqiu Yang

PDF

TL;DR

This paper introduces HEJ-Robust, a benchmark testing the robustness of LLM-based program repair models against syntactic variations, revealing significant performance drops.

Contribution

The creation of HEJ-Robust benchmark with transformations to evaluate and highlight the robustness issues of current LLM-based repair models.

Findings

01

Model performance drops over 50% under transformations.

02

Current models lack robustness to minor syntactic variations.

03

Benchmark exposes robustness gaps in LLM-based program repair.

Abstract

Recent Large Language Models (LLMs) have shown strong performance on automated program repair across standard benchmarks. However, these benchmarks evaluate models on a single canonical form of buggy code and do not reflect the syntactic variations commonly observed in real-world software, leaving robustness largely unexamined. In this work, we construct HEJ-Robust, a robustness benchmark built from HumanEval-Java-Bug using eight semantics-preserving code transformations, resulting in 1,450 transformed instances. We evaluate five fine-tuned LLMs on this benchmark and show that model performance drops by over 50% under several transformations, indicating that current LLM-based repair models lack robustness to minor syntactic variations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.