TL;DR
This paper evaluates large language models on their ability to repair architectural code smells, introducing SmellBench, a framework for systematic assessment and revealing current limitations in cross-module refactoring.
Contribution
It presents SmellBench, a novel framework for evaluating LLMs on architectural smell repair, including optimized prompts and a comprehensive scoring methodology.
Findings
63.1% of detected smells are false positives
Best agent resolves 47.7% of true smells
Most aggressive agent introduces 140 new smells
Abstract
Architectural code smells erode software maintainability and are costly to repair manually, yet unlike localized bugs, they require cross-module reasoning about design intent that challenges both developers and automated tools. While large language model agents excel at bug fixing and code-level refactoring, their ability to repair architectural code smells remains unexplored. We present the first empirical evaluation of LLM agents on architectural code smell repair. We contribute SmellBench, a task orchestration framework that incorporates smell-type-specific optimized prompts and supports iterative multi-step execution, together with a scoring methodology that separately evaluates repair effectiveness, false positive identification, and net codebase impact. We evaluate 11 agent configurations from four model families (GPT, Claude, Gemini, Mistral) on 65 hard-severity architectural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
