Improving Unstructured Data Quality via Updatable Extracted Views

Besat Kassaie; Frank Wm. Tompa

arXiv:2502.18221·cs.DB·February 26, 2025

Improving Unstructured Data Quality via Updatable Extracted Views

Besat Kassaie, Frank Wm. Tompa

PDF

Open Access

TL;DR

This paper presents a framework that uses rule-based information extraction to improve data quality in unstructured textual documents, verified through experiments on medical records.

Contribution

It introduces a set of conditions for including rule-based extraction programs in a data cleaning framework for unstructured documents.

Findings

01

Effective identification of data quality issues in medical records

02

Successful correction of data errors using the proposed framework

03

Practical applicability demonstrated in real-world scenarios

Abstract

Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring data quality. We propose leveraging information extraction algorithms to design, apply, and explain data cleaning processes for documents. Specifically, for a simple document update model, we identify and verify a set of sufficient conditions for rule-based extraction programs to qualify for inclusion in our document cleaning framework. Through experiments conducted on medical records, we demonstrate that our approach provides an effective framework for identifying and correcting data quality problems, thereby highlighting its practical value in real-world applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Data Management and Algorithms · Advanced Database Systems and Queries