Refactoring Codebases through Library Design

Ziga Kovacic; Justin T. Chiu; Celine Lee; Wenting Zhao; Kevin Ellis

arXiv:2506.11058·cs.SE·October 7, 2025

Refactoring Codebases through Library Design

Ziga Kovacic, Justin T. Chiu, Celine Lee, Wenting Zhao, Kevin Ellis

PDF

Open Access 3 Reviews

TL;DR

This paper explores how code agents can effectively refactor code into reusable libraries, introducing a benchmark and a method that outperform existing approaches in promoting maintainability and growth.

Contribution

It presents MiniCode, a new benchmark for refactoring into shared libraries, and Librarian, a novel method that improves library generation quality.

Findings

01

Minimum Description Length correlates with good refactorings

02

Librarian outperforms state-of-the-art library generation methods

03

Librarian effectively refactors real-world codebases

Abstract

Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- This paper is well written and achieves impressive real-world validation by refactoring HuggingFace production code with 67% MDL reduction while maintaining correctness. - This work provides systematic comparison of multiple metrics through asymptotic analysis and human studies, finding MDL superior to traditional software engineering metrics.

Weaknesses

- This evaluation covers only 10 Transformers files and 2 Diffusers tasks, which seems insufficient to support claims about general applicability to real software projects. - This human study with only 12 participants lacks statistical power to distinguish between MDL and tokens metrics, yet the authors make strong claims about MDL superiority.

Reviewer 02Rating 4Confidence 4

Strengths

- Code refactoring is an important software engineering activity. This paper demonstrates progress on this problem using a pipeline of clustering of code by natural language summary, cluster-specific library extraction and then rewriting the complete code corpus. - It assembles a benchmark taking code contest solutions, previous refactoring benchmarks and small sets of related files from transformers and diffusers libraries. The resulting refactorings are ranked using MDL and evaluated for corre

Weaknesses

- While the problem of refactoring is important, the proposed method is evaluated in limited setting. It does not present results at large scale where refactorings are most important and useful. Though the paper states that the proposed method is evaluated on "real-world code bases", the scope is restricted to a total of 3 tasks with 10 files each from 2 repositories. - The paper's novelty over past work, Regal, is limited as both of them apply clustering based refactoring. - The study of diffe

Reviewer 03Rating 4Confidence 4

Strengths

- MINICODE offers a practical and effective benchmark for evaluating code refactoring - The proposed metric MDL is reasonable and interesting.

Weaknesses

- The proposed sample-and-rerank approach is relatively simple, and the methodological insights it provides are limited. - The main risk of using MDL is that it can be heavily influenced by a single model. The paper only briefly discusses cross-model agreement for MDL in Section 6; a more detailed analysis would make the claim more convincing. - Even if the refactored code passes all unit tests, there is still a risk of semantic inequivalence with the original code. The paper lacks an analysis o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Web Data Mining and Analysis · Software Engineering Research