A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings

Simantika Bhattacharjee Dristi; Matthew B. Dwyer

arXiv:2602.15761·cs.SE·February 18, 2026

A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings

Simantika Bhattacharjee Dristi, Matthew B. Dwyer

PDF

Open Access

TL;DR

This paper introduces a differential fuzzing approach to evaluate the functional equivalence of LLM-generated code refactorings, revealing that many refactorings are semantically non-equivalent and often undetected by existing test suites.

Contribution

It proposes a novel differential fuzzing-based method for assessing functional equivalence without predefined tests, enabling large-scale evaluation of LLM-generated code refactorings.

Findings

01

19-35% of refactorings are functionally non-equivalent.

02

Approximately 21% of non-equivalent refactorings are undetected by existing tests.

03

LLMs can produce refactorings that alter program semantics.

Abstract

With the rapid adoption of large language models (LLMs) in automated code refactoring, assessing and ensuring functional equivalence between LLM-generated refactoring and the original implementation becomes critical. While prior work typically relies on predefined test cases to evaluate correctness, in this work, we leverage differential fuzzing to check functional equivalence in LLM-generated code refactorings. Unlike test-based evaluation, a differential fuzzing-based equivalence checker needs no predefined test cases and can explore a much larger input space by executing and comparing thousands of automatically generated test inputs. In a large-scale evaluation of six LLMs (CodeLlama, Codestral, StarChat2, Qwen-2.5, Olmo-3, and GPT-4o) across three datasets and two refactoring types, we find that LLMs show a non-trivial tendency to alter program semantics, producing 19-35%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Model-Driven Software Engineering Techniques