ReCatcher: Towards LLMs Regression Testing for Code Generation
Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, and Foutse Khomh

TL;DR
ReCatcher is a framework for systematically detecting regressions in code generation by LLMs across correctness, quality, and performance after model updates, aiding informed decision-making.
Contribution
It introduces ReCatcher, a novel regression testing framework specifically designed for evaluating LLMs in code generation tasks across multiple update scenarios.
Findings
Fine-tuning increases syntax errors by up to 12%.
Merging causes correctness regressions up to 18%.
GPT-4o has up to 50% regressions in missing imports handling.
Abstract
Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Topic Modeling
