Predicting the Neutral Hydrogen Content of Galaxies From Optical Data Using Machine Learning
Mika Rafieferantsoa, Sambatra Andrianomena, Romeel Dav\'e

TL;DR
This paper presents a machine learning framework trained on simulations to predict galaxy HI content from optical and environmental data, achieving high accuracy at low redshift and useful predictions for upcoming surveys.
Contribution
The study introduces a novel machine learning approach trained on cosmological simulations to accurately predict galaxy HI content using optical and environmental features.
Findings
Random forest achieves the highest correlation coefficient (>0.9) at z=0.
Prediction accuracy declines with increasing redshift, limiting utility beyond z~1.
The method performs well on real survey data, with RMSE around 0.28 when trained on RESOLVE data.
Abstract
We develop a machine learning-based framework to predict the HI content of galaxies using more straightforwardly observable quantities such as optical photometry and environmental parameters. We train the algorithm on z=0-2 outputs from the Mufasa cosmological hydrodynamic simulation, which includes star formation, feedback, and a heuristic model to quench massive galaxies that yields a reasonable match to a range of survey data including HI. We employ a variety of machine learning methods (regressors), and quantify their performance using the root mean square error ({\sc rmse}) and the Pearson correlation coefficient (r). Considering SDSS photometry, 3 nearest neighbor environment and line of sight peculiar velocities as features, we obtain r accuracy of the HI-richness prediction, corresponding to {\sc rmse}. Adding near-IR photometry to the features yields some…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
