Classification of datasets with imputed missing values: does imputation   quality matter?

Tolou Shadbahr; Michael Roberts; Jan Stanczuk; Julian Gilbey; and Philip Teare; S\"oren Dittmer; Matthew Thorpe; Ramon Vinas Torne; Evis; Sala; Pietro Lio; Mishal Patel; AIX-COVNET Collaboration; James H.F. Rudd,; Tuomas Mirtti; Antti Rannikko; John A.D. Aston; Jing Tang; Carola-Bibiane; Sch\"onlieb

arXiv:2206.08478·cs.LG·December 20, 2023

Classification of datasets with imputed missing values: does imputation quality matter?

Tolou Shadbahr, Michael Roberts, Jan Stanczuk, Julian Gilbey, and Philip Teare, S\"oren Dittmer, Matthew Thorpe, Ramon Vinas Torne, Evis, Sala, Pietro Lio, Mishal Patel, AIX-COVNET Collaboration, James H.F. Rudd,, Tuomas Mirtti, Antti Rannikko, John A.D. Aston, Jing Tang

PDF

TL;DR

This paper investigates the impact of imputation quality on classification performance in incomplete datasets, revealing flaws in current assessment methods and proposing new discrepancy scores to better evaluate imputation methods.

Contribution

It introduces a novel class of discrepancy scores for assessing imputation quality based on data distribution recreation, emphasizing the importance of imputation quality for classifier interpretability.

Findings

01

Current quality measures are flawed

02

Proposed discrepancy scores better evaluate imputation

03

Poor imputation impairs classifier interpretability

Abstract

Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete, imputed, samples. The focus of the machine learning researcher is then to optimise the downstream classification performance. In this study, we highlight that it is imperative to consider the quality of the imputation. We demonstrate how the commonly used measures for assessing quality are flawed and propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data. To conclude, we highlight the compromised interpretability of classifier models trained using poorly imputed data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.