Data Validation Infrastructure for R
Mark P.J. van der Loo, Edwin de Jonge

TL;DR
The paper introduces the 'validate' R package, which provides a flexible infrastructure for defining, applying, and managing data validation rules to ensure data quality in statistical analysis.
Contribution
It presents a comprehensive system for capturing, manipulating, and applying validation rules in R, enabling systematic data quality checks and reuse across data set versions.
Findings
Supports expert-defined validation rules with metadata
Allows confrontation of rules with data for validation results
Enables storage and retrieval of rules from external sources
Abstract
Checking data quality against domain knowledge is a common activity that pervades statistical analysis from raw data to output. The R package 'validate' facilitates this task by capturing and applying expert knowledge in the form of validation rules: logical restrictions on variables, records, or data sets that should be satisfied before they are considered valid input for further analysis. In the validate package, validation rules are objects of computation that can be manipulated, investigated, and confronted with data or versions of a data set. The results of a confrontation are then available for further investigation, summarization or visualization. Validation rules can also be endowed with metadata and documentation and they may be stored or retrieved from external sources such as text files or tabular formats. This data validation infrastructure thus allows for systematic,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
