Semi-supervised classification of stars, galaxies and quasars using K-means and random-forest approaches
Vahid Asadi, Hosein Haghi, Akram Hasani Zonoozi

TL;DR
This paper introduces a semi-supervised learning framework combining K-means clustering and random forests to classify stars, galaxies, and quasars efficiently, achieving high accuracy with limited labeled data in large astronomical surveys.
Contribution
The paper presents a novel SSL method that effectively propagates labels from a small set of spectroscopic data to unlabeled data using clustering, improving classification performance in astronomy.
Findings
Achieves F1 scores over 98% for stars and galaxies, and 92% for quasars.
Outperforms traditional color-cut classification methods.
Demonstrates robustness in high-dimensional feature spaces.
Abstract
Classifying stars, galaxies, and quasars is essential for understanding cosmic structure and evolution; however, the vast data from modern surveys make manual classification impractical, while supervised learning methods remain constrained by the scarcity of labeled spectroscopic data. We aim to develop a scalable, label-efficient method for astronomical classification by leveraging semi-supervised learning (SSL) to overcome the limitations of fully supervised approaches. We propose a novel SSL framework combining K-means clustering with random forest classification. Our method partitions unlabeled data into 50 clusters, propagates labels from spectroscopically confirmed centroids to 95% of cluster members, and trains a random forest on the expanded pseudo-labeled dataset. We applied this to the CPz catalog, containing multi-survey photometric and spectroscopic data, and compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
