Improving Data Cleaning Using Discrete Optimization
Kenneth Smith, Sharlee Climer

TL;DR
This paper introduces optimized algorithms for data cleaning that efficiently remove missing data, outperforming traditional deletion methods by balancing data retention and runtime, especially in complex biological datasets.
Contribution
It reformulates existing integer programming models into more efficient linear and reduced-variable forms, enabling faster and more scalable partial deletion of missing data.
Findings
Algorithms outperform existing deletion techniques across multiple missingness levels.
Greedy algorithm retains maximum valid data in most scenarios.
Reformulated models reduce runtime and increase data retention in large datasets.
Abstract
One of the most important processing steps in any analysis pipeline is handling missing data. Traditional approaches simply delete any sample or feature with missing elements. Recent imputation methods replace missing data based on assumed relationships between observed data and the missing elements. However, there is a largely under-explored alternative amid these extremes. Partial deletion approaches remove excessive amounts of missing data, as defined by the user. They can be used in place of traditional deletion or as a precursor to imputation. In this manuscript, we expand upon the Mr. Clean suite of algorithms, focusing on the scenario where all missing data is removed. We show that the RowCol Integer Program can be recast as a Linear Program, thereby reducing runtime. Additionally, the Element Integer Program can be reformulated to reduce the number of variables and allow for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Big Data and Business Intelligence
