# Automated data validation: an industrial experience report

**Authors:** Lei Zhang, Sean Howard, Tom Montpool, Jessica Moore, Krittika Mahajan,, Andriy Miranskyy

arXiv: 1903.03676 · 2022-12-06

## TL;DR

This paper presents an industrial case study of RESTORE, an open-source R package for automated data validation that improves error detection and reduces testing costs in data science workflows.

## Contribution

It introduces RESTORE, a novel automated data validation tool applying software engineering best practices to data science, demonstrated through a real-world industrial case study.

## Key findings

- RESTORE effectively detects errors in data preparation.
- Using RESTORE reduces testing costs significantly.
- The approach enhances data quality and reliability in data science processes.

## Abstract

There has been a massive explosion of data generated by customers and retained by companies in the last decade. However, there is a significant mismatch between the increasing volume of data and the lack of automation methods and tools. The lack of best practices in data science programming may lead to software quality degradation, release schedule slippage, and budget overruns. To mitigate these concerns, we would like to bring software engineering best practices into data science. Specifically, we focus on automated data validation in the data preparation phase of the software development life cycle.   This paper studies a real-world industrial case and applies software engineering best practices to develop an automated test harness called RESTORE. We release RESTORE as an open-source R package. Our experience report, done on the geodemographic data, shows that RESTORE enables efficient and effective detection of errors injected during the data preparation phase. RESTORE also significantly reduced the cost of testing. We hope that the community benefits from the open-source project and the practical advice based on our experience.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1903.03676/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1903.03676/full.md

## References

64 references — full list in the complete paper: https://tomesphere.com/paper/1903.03676/full.md

---
Source: https://tomesphere.com/paper/1903.03676