QUIP: Query-driven Missing Value Imputation

Yiming Lin; Sharad Mehrotra

arXiv:2204.00108·cs.DB·April 6, 2022

QUIP: Query-driven Missing Value Imputation

Yiming Lin, Sharad Mehrotra

PDF

Open Access

TL;DR

QUIP introduces a query-time missing value imputation method that minimizes data modification and processing overhead, significantly improving efficiency over traditional offline imputation methods.

Contribution

The paper presents QUIP, a novel approach that performs minimal, query-driven missing value imputation with optimized algorithms and data structures, enhancing query accuracy and speed.

Findings

01

Outperforms ImputeDB by 2 to 10 times on various datasets.

02

Achieves order-of-magnitude improvements over offline imputation.

03

Effectively reduces missing data handling costs during query processing.

Abstract

Missing values widely exist in real-world data sets, and failure to clean the missing data may result in the poor quality of answers to queries. \yiming{Traditionally, missing value imputation has been studied as an offline process as part of preparing data for analysis.} This paper studies query-time missing value imputation and proposes QUIP, which only imputes minimal missing values to answer the query. Specifically, by taking a reasonable good query plan as input, QUIP tries to minimize the missing value imputation cost and query processing overhead. QUIP proposes a new implementation of outer join to preserve missing values in query processing and a bloom filter based index structure to optimize the space and runtime overhead. QUIP also designs a cost-based decision function to automatically guide each operator to impute missing values now or delay imputations. Efficient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Bayesian Modeling and Causal Inference · Traffic Prediction and Management Techniques