SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring
Yisen Xu, Jinqiu Yang, Tse-Hsun (Peter) Chen

TL;DR
SWE-Refactor introduces a comprehensive repository-level benchmark with 1,099 real-world, behavior-preserving Java refactorings to evaluate LLMs' effectiveness in semantic code editing.
Contribution
It provides the first large-scale, validated benchmark for LLM-based code refactoring at the repository level, addressing previous limitations in scope and context.
Findings
LLMs struggle with complex and compound refactorings.
OpenAI Codex achieves 39.4% success rate on compound instances.
Benchmark and evaluation results are publicly released.
Abstract
Large Language Models (LLMs) have recently attracted wide interest for tackling software engineering tasks. In contrast to code generation, refactoring demands precise, semantics-preserving edits that improve program structure, which also makes automated evaluation challenging. However, existing refactoring benchmarks commonly suffer from three shortcomings: limited coverage of refactoring scenarios, the inclusion of instances that mix refactoring with unrelated changes, and insufficient repository-level context for realistic assessment. To mitigate these issues, we introduce SWE-Refactor, a new benchmark for LLM-based code refactoring. SWE-Refactor comprises 1,099 developer-written, behavior-preserving refactorings mined from 18 Java projects, including 922 atomic and 177 compound instances. Each instance is validated via compilation, test execution, and automated refactoring detection…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Model-Driven Software Engineering Techniques
