bigMICE: Multiple Imputation of Big Data
Hugo Morvan, Jonas Agholme, Bjorn Eliasson, Katarina Olofsson, Ludger Grote, Fredrik Iredahl, Oleg Sysoev

TL;DR
bigMICE adapts the MICE imputation method for large datasets using Apache Spark, enabling efficient, memory-conscious multiple imputation on big data, demonstrated on Swedish medical registries.
Contribution
Introduces bigMICE, a scalable implementation of MICE for big data using Spark, with memory control and improved speed over traditional methods.
Findings
More memory efficient and faster than standard MICE implementations.
High-quality imputations achievable with large datasets despite high missingness.
Effective on large Swedish medical registry data.
Abstract
Missing data is a prevalent issue in many applications, including large medical registries such as the Swedish Healthcare Quality Registries, potentially leading to biased or inefficient analyses if not handled properly. Multiple Imputation by Chained Equations (MICE) is a popular and versatile method for handling multivariate missing data but traditional implementations face significant challenges when applied to big data sets due to computational time and memory limitations. To address this, the bigMICE package was developed, adapting the MICE framework to big data using Apache Spark MLLib and Spark ML. Our implementation allows for controlling the maximum memory usage during the execution, enabling processing of very large data sets on a hardware with a limited memory, such as ordinary laptops. The developed package was tested on a large Swedish medical registry to measure memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Bayesian Inference · Machine Learning in Healthcare · Data Analysis with R
