Missing Data Imputation for Galaxy Redshift Estimation
Kieran J. Luken, Rabina Padhy, X. Rosalind Wang

TL;DR
This paper evaluates various data imputation methods, including simple and complex algorithms, for handling missing data in galaxy redshift estimation, finding MICE to be most effective in reducing prediction error.
Contribution
It compares multiple imputation techniques for astronomical data and demonstrates that MICE yields the lowest prediction error in galaxy redshift estimation.
Findings
MICE achieves the lowest Root Mean Square Error.
GAIN performs better than simple methods but worse than MICE.
Imputation improves redshift prediction accuracy.
Abstract
Astronomical data is full of holes. While there are many reasons for this missing data, the data can be randomly missing, caused by things like data corruptions or unfavourable observing conditions. We test some simple data imputation methods(Mean, Median, Minimum, Maximum and k-Nearest Neighbours (kNN)), as well as two more complex methods (Multivariate Imputation by using Chained Equation (MICE) and Generative Adversarial Imputation Network (GAIN)) against data where increasing amounts are randomly set to missing. We then use the imputed datasets to estimate the redshift of the galaxies, using the kNN and Random Forest ML techniques. We find that the MICE algorithm provides the lowest Root Mean Square Error and consequently the lowest prediction error, with the GAIN algorithm the next best.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Galaxies: Formation, Evolution, Phenomena · Gaussian Processes and Bayesian Inference
