Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset

Zhipeng Xue; Xiaoting Zhang; Zhipeng Gao; Xing Hu; Shan Gao; Xin Xia; Shanping Li

arXiv:2508.11958·cs.SE·August 19, 2025

Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset

Zhipeng Xue, Xiaoting Zhang, Zhipeng Gao, Xing Hu, Shan Gao, Xin Xia, Shanping Li

PDF

Open Access

TL;DR

This paper systematically investigates the presence of code smells in LLM training datasets and outputs, proposing an automatic cleaning tool to improve dataset quality and enhance LLM performance in code-related tasks.

Contribution

It introduces SmellCC, an automatic code smell cleaning tool, and demonstrates its effectiveness in improving LLM training data and downstream task performance.

Findings

01

Code smells are prevalent in benchmark datasets and LLM outputs.

02

Cleaning code smells improves LLM fine-tuning and downstream task results.

03

The curated dataset leads to better code quality in generated outputs.

Abstract

The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g., the training code quality). Given that code smells are widely existed in practice and can negatively impact software maintainability and readability, this study takes the first systematic research to assess and improve dataset quality in terms of code smells. In this work, we first conduct a preliminary study to explore the presence of code smells in a popular benchmark dataset (i.e., CodeSearchNet-Python}) and evaluate the output of several popular LLMs (i.e., DeepSeek-Coder, CodeLlama, and MagiCoder), revealing that code smell issues extensively exist in LLM's input (e.g., benchmark dataset) and output (e.g., generated code). We then conduct our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Digital Rights Management and Security · Software Engineering Research