TL;DR
This paper develops a machine learning classifier trained on SDSS data to accurately categorize 111 million celestial sources into galaxies, quasars, and stars, significantly expanding the catalog of classified objects without requiring spectra.
Contribution
The study introduces an optimized random forest model and a comprehensive catalog of classifications for unlabelled SDSS sources, utilizing photometry and transfer learning techniques.
Findings
Classified 111 million sources with high confidence probabilities.
Achieved strong agreement between UMAP visualizations and classifier labels.
Analyzed the impact of class imbalance and magnitude errors on classification performance.
Abstract
We used 3.1 million spectroscopically labelled sources from the Sloan Digital Sky Survey (SDSS) to train an optimised random forest classifier using photometry from the SDSS and the Widefield Infrared Survey Explorer (WISE). We applied this machine learning model to 111 million previously unlabelled sources from the SDSS photometric catalogue which did not have existing spectroscopic observations. Our new catalogue contains 50.4 million galaxies, 2.1 million quasars, and 58.8 million stars. We provide individual classification probabilities for each source, with 6.7 million galaxies (13%), 0.33 million quasars (15%), and 41.3 million stars (70%) having classification probabilities greater than 0.99; and 35.1 million galaxies (70%), 0.72 million quasars (34%), and 54.7 million stars (93%) having classification probabilities greater than 0.9. Precision, Recall, and F1 score were…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
