A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair
Milan De Koning, Ali Asgari, Pouria Derakhshanfar, Annibale Panichella

TL;DR
This paper combines metamorphic testing and negative log-likelihood to better detect data leakage in LLM-based program repair, revealing that models perform worse on transformed benchmarks, indicating memorization.
Contribution
It introduces a novel approach combining metamorphic testing with NLL to diagnose memorization and data leakage in LLM-based program repair evaluations.
Findings
State-of-the-art LLMs show significant success rate drops on transformed benchmarks.
Performance degradation correlates strongly with NLL, indicating memorization.
Metamorphic testing helps mitigate effects of data leakage in evaluations.
Abstract
LLM-based automated program repair (APR) techniques have shown promising results in reducing debugging costs. However, prior results can be affected by data leakage: large language models (LLMs) may memorize bug fixes when evaluation benchmarks overlap with their pretraining data, leading to inflated performance estimates. In this paper, we investigate whether we can better reveal data leakage by combining metamorphic testing (MT) with negative log-likelihood (NLL), which has been used in prior work as a proxy for memorization. We construct variant benchmarks by applying semantics-preserving transformations to two widely used datasets, Defects4J and GitBug-Java. Using these benchmarks, we evaluate the repair success rates of seven LLMs on both original and transformed versions, and analyze the relationship between performance degradation and NLL. Our results show that all evaluated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
