bigMICE: Multiple Imputation of Big Data

Hugo Morvan; Jonas Agholme; Bjorn Eliasson; Katarina Olofsson; Ludger Grote; Fredrik Iredahl; Oleg Sysoev

arXiv:2601.21613·stat.CO·January 30, 2026

bigMICE: Multiple Imputation of Big Data

Hugo Morvan, Jonas Agholme, Bjorn Eliasson, Katarina Olofsson, Ludger Grote, Fredrik Iredahl, Oleg Sysoev

PDF

Open Access

TL;DR

bigMICE adapts the MICE imputation method for large datasets using Apache Spark, enabling efficient, memory-conscious multiple imputation on big data, demonstrated on Swedish medical registries.

Contribution

Introduces bigMICE, a scalable implementation of MICE for big data using Spark, with memory control and improved speed over traditional methods.

Findings

01

More memory efficient and faster than standard MICE implementations.

02

High-quality imputations achievable with large datasets despite high missingness.

03

Effective on large Swedish medical registry data.

Abstract

Missing data is a prevalent issue in many applications, including large medical registries such as the Swedish Healthcare Quality Registries, potentially leading to biased or inefficient analyses if not handled properly. Multiple Imputation by Chained Equations (MICE) is a popular and versatile method for handling multivariate missing data but traditional implementations face significant challenges when applied to big data sets due to computational time and memory limitations. To address this, the bigMICE package was developed, adapting the MICE framework to big data using Apache Spark MLLib and Spark ML. Our implementation allows for controlling the maximum memory usage during the execution, enabling processing of very large data sets on a hardware with a limited memory, such as ordinary laptops. The developed package was tested on a large Swedish medical registry to measure memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Bayesian Inference · Machine Learning in Healthcare · Data Analysis with R