Chains of Autoreplicative Random Forests for missing value imputation in   high-dimensional datasets

Ekaterina Antonenko; Jesse Read

arXiv:2301.00595·cs.LG·January 3, 2023

Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets

Ekaterina Antonenko, Jesse Read

PDF

Open Access

TL;DR

This paper introduces Chains of Autoreplicative Random Forests, a novel method for imputing missing values in high-dimensional datasets like SNP data, outperforming existing algorithms without needing extra information.

Contribution

The paper proposes a new multi-label Random Forest-based approach for missing value imputation, effective in low-sample, high-dimensional scenarios, especially for SNP datasets.

Findings

01

Effective imputation of missing SNP data

02

Outperforms standard algorithms in accuracy

03

Requires no additional dataset information

Abstract

Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCancer-related molecular mechanisms research · Gene expression and cancer classification · MicroRNA in disease regulation