Missing Data Infill with Automunge
Nicholas J.Teague

TL;DR
This paper introduces Automunge's open source library for missing data imputation in tabular data, demonstrating that ML-based infill often outperforms traditional methods and recommending its default use with support columns for improved downstream model performance.
Contribution
The paper presents an auto ML-based missing data imputation library integrated into Automunge, offering a production-friendly, consistent, and easy-to-use solution for tabular data preprocessing.
Findings
ML infill often outperforms traditional imputation methods.
Adding support columns signaling infill presence improves downstream models.
Automunge provides a comprehensive, integrated platform for data preprocessing.
Abstract
Missing data is a fundamental obstacle in the practice of data science. This paper surveys a few conventions for imputation as available in the Automunge open source python library platform for tabular data preprocessing, including "ML infill" in which auto ML models are trained for target features from partitioned extracts of a training set. A series of validation experiments were performed to benchmark imputation scenarios towards downstream model performance, in which it was found for the given benchmark sets that in many cases ML infill outperformed for both numeric and categoric target features, and was otherwise at minimum within noise distributions of the other imputation scenarios. Evidence also suggested supplementing ML infill with the addition of support columns with boolean integer markers signaling presence of infill was usually beneficial to downstream model performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Mining Algorithms and Applications · Explainable Artificial Intelligence (XAI)
