Chains of Autoreplicative Random Forests for missing value imputation in high-dimensional datasets
Ekaterina Antonenko, Jesse Read

TL;DR
This paper introduces Chains of Autoreplicative Random Forests, a novel method for imputing missing values in high-dimensional datasets like SNP data, outperforming existing algorithms without needing extra information.
Contribution
The paper proposes a new multi-label Random Forest-based approach for missing value imputation, effective in low-sample, high-dimensional scenarios, especially for SNP datasets.
Findings
Effective imputation of missing SNP data
Outperforms standard algorithms in accuracy
Requires no additional dataset information
Abstract
Missing values are a common problem in data science and machine learning. Removing instances with missing values can adversely affect the quality of further data analysis. This is exacerbated when there are relatively many more features than instances, and thus the proportion of affected instances is high. Such a scenario is common in many important domains, for example, single nucleotide polymorphism (SNP) datasets provide a large number of features over a genome for a relatively small number of individuals. To preserve as much information as possible prior to modeling, a rigorous imputation scheme is acutely needed. While Denoising Autoencoders is a state-of-the-art method for imputation in high-dimensional data, they still require enough complete cases to be trained on which is often not available in real-world problems. In this paper, we consider missing value imputation as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCancer-related molecular mechanisms research · Gene expression and cancer classification · MicroRNA in disease regulation
