Refactoring Codebases through Library Design
Ziga Kovacic, Justin T. Chiu, Celine Lee, Wenting Zhao, Kevin Ellis

TL;DR
This paper explores how code agents can effectively refactor code into reusable libraries, introducing a benchmark and a method that outperform existing approaches in promoting maintainability and growth.
Contribution
It presents MiniCode, a new benchmark for refactoring into shared libraries, and Librarian, a novel method that improves library generation quality.
Findings
Minimum Description Length correlates with good refactorings
Librarian outperforms state-of-the-art library generation methods
Librarian effectively refactors real-world codebases
Abstract
Maintainable and general software allows developers to build robust applications efficiently, yet achieving these qualities often requires refactoring specialized solutions into reusable components. This challenge becomes particularly relevant as code agents become used to solve isolated one-off programming problems. We investigate code agents' capacity to refactor code in ways that support growth and reusability. We first investigate what makes a good refactoring, finding via simulation results and a human study that Minimum Description Length best correlates with preferable refactorings. We then present both a benchmark and a method for refactoring: MiniCode, a benchmark where multiple files must be refactored into a shared library, and Librarian, a sample-and-rerank method for generating reusable libraries. We compare Librarian to state-of-the-art library generation methods, and…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper is well written and achieves impressive real-world validation by refactoring HuggingFace production code with 67% MDL reduction while maintaining correctness. - This work provides systematic comparison of multiple metrics through asymptotic analysis and human studies, finding MDL superior to traditional software engineering metrics.
- This evaluation covers only 10 Transformers files and 2 Diffusers tasks, which seems insufficient to support claims about general applicability to real software projects. - This human study with only 12 participants lacks statistical power to distinguish between MDL and tokens metrics, yet the authors make strong claims about MDL superiority.
- Code refactoring is an important software engineering activity. This paper demonstrates progress on this problem using a pipeline of clustering of code by natural language summary, cluster-specific library extraction and then rewriting the complete code corpus. - It assembles a benchmark taking code contest solutions, previous refactoring benchmarks and small sets of related files from transformers and diffusers libraries. The resulting refactorings are ranked using MDL and evaluated for corre
- While the problem of refactoring is important, the proposed method is evaluated in limited setting. It does not present results at large scale where refactorings are most important and useful. Though the paper states that the proposed method is evaluated on "real-world code bases", the scope is restricted to a total of 3 tasks with 10 files each from 2 repositories. - The paper's novelty over past work, Regal, is limited as both of them apply clustering based refactoring. - The study of diffe
- MINICODE offers a practical and effective benchmark for evaluating code refactoring - The proposed metric MDL is reasonable and interesting.
- The proposed sample-and-rerank approach is relatively simple, and the methodological insights it provides are limited. - The main risk of using MDL is that it can be heavily influenced by a single model. The paper only briefly discusses cross-model agreement for MDL in Section 6; a more detailed analysis would make the claim more convincing. - Even if the refactored code passes all unit tests, there is still a risk of semantic inequivalence with the original code. The paper lacks an analysis o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Web Data Mining and Analysis · Software Engineering Research
