A Robust Missing Value Imputation Method MifImpute For Incomplete Molecular Descriptor Data And Comparative Analysis With Other Missing Value Imputation Methods
Doreswamy, Chanabasayya .M. Vastrad

TL;DR
This paper introduces MiFoImpute, a robust random forest-based method for imputing missing values in molecular descriptor data, demonstrating superior accuracy and efficiency over existing methods in pharmaceutical datasets.
Contribution
The paper presents MiFoImpute, a novel iterative imputation technique based on random forests that effectively handles high-dimensional molecular data with missing values.
Findings
MiFoImpute outperforms ten benchmark imputation methods in accuracy.
It maintains robustness across missing data rates from 10% to 30%.
The method is computationally efficient for high-dimensional datasets.
Abstract
Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regression trees, random forest intrinsically constitutes a multiple imputation scheme. Using the NRMSE and NMAE estimates of random forest, we are able to estimate the imputation error. Evaluation is performed on two molecular descriptor datasets generated from a diverse selection of pharmaceutical fields with artificially introduced missing values ranging from 10% to 30%. The experimental result demonstrates that missing values has a great impact on the effectiveness of imputation techniques and our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Gene expression and cancer classification · SARS-CoV-2 detection and testing
