ReCatcher: Towards LLMs Regression Testing for Code Generation

Altaf Allah Abbassi; Leuson Da Silva; Amin Nikanjam; and Foutse Khomh

arXiv:2507.19390·cs.SE·July 28, 2025

ReCatcher: Towards LLMs Regression Testing for Code Generation

Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, and Foutse Khomh

PDF

Open Access

TL;DR

ReCatcher is a framework for systematically detecting regressions in code generation by LLMs across correctness, quality, and performance after model updates, aiding informed decision-making.

Contribution

It introduces ReCatcher, a novel regression testing framework specifically designed for evaluating LLMs in code generation tasks across multiple update scenarios.

Findings

01

Fine-tuning increases syntax errors by up to 12%.

02

Merging causes correctness regressions up to 18%.

03

GPT-4o has up to 50% regressions in missing imports handling.

Abstract

Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling