The Prevalence of Code Smells in Machine Learning projects

Bart van Oort; Lu\'is Cruz; Maur\'icio Aniche; Arie van Deursen

arXiv:2103.04146·cs.SE·March 9, 2021

The Prevalence of Code Smells in Machine Learning projects

Bart van Oort, Lu\'is Cruz, Maur\'icio Aniche, Arie van Deursen

PDF

2 Repos

TL;DR

This study investigates common code smells in machine learning projects, revealing widespread issues like code duplication and dependency management problems that hinder maintainability and reproducibility.

Contribution

It provides an empirical analysis of prevalent code smells in ML projects and highlights specific challenges in dependency management and static analysis tools.

Findings

01

Code duplication is widespread in ML projects.

02

Dependency management issues obstruct maintainability.

03

Pylint struggles to verify correct usage of ML libraries.

Abstract

Artificial Intelligence (AI) and Machine Learning (ML) are pervasive in the current computer science landscape. Yet, there still exists a lack of software engineering experience and best practices in this field. One such best practice, static code analysis, can be used to find code smells, i.e., (potential) defects in the source code, refactoring opportunities, and violations of common coding standards. Our research set out to discover the most prevalent code smells in ML projects. We gathered a dataset of 74 open-source ML projects, installed their dependencies and ran Pylint on them. This resulted in a top 20 of all detected code smells, per category. Manual analysis of these smells mainly showed that code duplication is widespread and that the PEP8 convention for identifier naming style may not always be applicable to ML code due to its resemblance with mathematical notation. More…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.