On the experiences of adopting automated data validation in an industrial machine learning project
Lucy Ellen Lwakatare, Ellinor R{\aa}nge, Ivica Crnkovic, Jan Bosch

TL;DR
This paper explores the practical adoption of data validation processes in industrial machine learning projects, highlighting best practices, benefits, barriers, and proposing a framework to systematize implementation.
Contribution
It introduces a data validation framework (DVF) and provides empirical insights into best practices and challenges in adopting data validation in industrial ML settings.
Findings
Identified three best practices for data validation adoption.
Highlighted three benefits of implementing data validation.
Outlined two barriers faced during adoption.
Abstract
Background: Data errors are a common challenge in machine learning (ML) projects and generally cause significant performance degradation in ML-enabled software systems. To ensure early detection of erroneous data and avoid training ML models using bad data, research and industrial practice suggest incorporating a data validation process and tool in ML system development process. Aim: The study investigates the adoption of a data validation process and tool in industrial ML projects. The data validation process demands significant engineering resources for tool development and maintenance. Thus, it is important to identify the best practices for their adoption especially by development teams that are in the early phases of deploying ML-enabled software systems. Method: Action research was conducted at a large-software intensive organization in telecommunications, specifically within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
