Automated Data Quality Validation in an End-to-End GNN Framework
Sijie Dong, Soror Sahri, Themis Palpanas, Qitong Wang

TL;DR
This paper introduces DQuag, an innovative end-to-end framework utilizing advanced GNNs and multi-task learning to automatically validate and repair data quality issues in tabular datasets, surpassing traditional methods.
Contribution
The paper presents a novel GNN-based framework that automatically detects complex hidden data errors and recommends repairs without manual constraint specification.
Findings
Effective detection of explicit and hidden data errors.
Automatic data repair recommendations.
Outperforms existing validation methods.
Abstract
Ensuring data quality is crucial in modern data ecosystems, especially for training or testing datasets in machine learning. Existing validation approaches rely on computing data quality metrics and/or using expert-defined constraints. Although there are automated constraint generation methods, they are often incomplete and may be too strict or too soft, causing false positives or missed errors, thus requiring expert adjustment. These methods may also fail to detect subtle data inconsistencies hidden by complex interdependencies within the data. In this paper, we propose DQuag, an end-to-end data quality validation and repair framework based on an improved Graph Neural Network (GNN) and multi-task learning. The proposed method incorporates a dual-decoder design: one for data quality validation and the other for data repair. Our approach captures complex feature relationships within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Radiography and Breast Imaging · Data Quality and Management
