Missing Data Imputation using Optimal Transport

Boris Muzellec; Julie Josse; Claire Boyer; Marco Cuturi

arXiv:2002.03860·stat.ML·July 2, 2020·41 cites

Missing Data Imputation using Optimal Transport

Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a novel data imputation method using optimal transport distances, which effectively handles various missing data scenarios and outperforms existing techniques.

Contribution

It proposes a new optimal transport-based loss function for missing data imputation and practical end-to-end learning methods that adapt to different distribution assumptions.

Findings

01

OT-based methods match or outperform state-of-the-art imputation techniques

02

Effective in MCAR, MAR, and MNAR missing data settings

03

Works well even with high percentages of missing data

Abstract

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BorisMuzellec/MissingDataOT
pytorchOfficial

Videos

Missing Data Imputation using Optimal Transport· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Privacy-Preserving Technologies in Data · Stochastic Gradient Optimization Techniques