Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study
Sivajeet Chand, Melih Kilic, Roland W\"ursching, Sushant Kumar Pandey, Alexander Pretschner

TL;DR
This study evaluates open-source large language models for automating Extract Method Refactoring in Python, demonstrating that recursive prompting improves code quality and acceptance over traditional methods.
Contribution
It provides a systematic comparison of LLMs with recursive prompting for automated refactoring, highlighting the effectiveness of RCI strategies and establishing a benchmark for future research.
Findings
RCI prompting outperforms one-shot prompting in test pass rates.
Deepseek-Coder-RCI and Qwen2.5-Coder-RCI achieve high test pass percentages.
Over 70% developer acceptance for RCI-generated refactorings.
Abstract
Automating the Extract Method refactoring (EMR) remains challenging and largely manual despite its importance in improving code readability and maintainability. Recent advances in open-source, resource-efficient Large Language Models (LLMs) offer promising new approaches for automating such high-level tasks. In this work, we critically evaluate five state-of-the-art open-source LLMs, spanning 3B to 8B parameter sizes, on the EMR task for Python code. We systematically assess functional correctness and code quality using automated metrics and investigate the impact of prompting strategies by comparing one-shot prompting to a Recursive criticism and improvement (RCI) approach. RCI-based prompting consistently outperforms one-shot prompting in test pass rates and refactoring quality. The best-performing models, Deepseek-Coder-RCI and Qwen2.5-Coder-RCI, achieve test pass percentage (TPP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
