Toward a view-based data cleaning architecture

Toshiyuki Shimizu; Hiroki Omori; Masatoshi Yoshikawa

arXiv:1910.11040·cs.DB·October 25, 2019

Toward a view-based data cleaning architecture

Toshiyuki Shimizu, Hiroki Omori, Masatoshi Yoshikawa

PDF

Open Access

TL;DR

This paper proposes a view-based data cleaning architecture that enables data managers to interactively browse and correct data, addressing challenges in cleaning complex, heterogeneous datasets requiring expert knowledge.

Contribution

It introduces a novel architecture that facilitates interactive data cleaning through views, improving efficiency for complex data sources.

Findings

01

Supports interactive data browsing and correction

02

Addresses cleaning of heterogeneous and expert-dependent data

03

Discusses remaining challenges in view-based cleaning

Abstract

Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automatically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Privacy-Preserving Technologies in Data · Big Data Technologies and Applications