Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy
Thomas Devine, Katerina Goseva-Popstojanova, Di Pang

TL;DR
This paper presents a scalable, parallel approach for detecting and classifying radio pulsar signals in large datasets using Apache Spark and machine learning, significantly improving speed and efficiency.
Contribution
It introduces a scalable Spark-based detection algorithm and a novel multiclass machine learning technique with feature selection for faster pulsar classification.
Findings
Speedup of up to 5X in candidate identification
54% average speed improvement in machine learning classification
Less than 2% reduction in classification accuracy
Abstract
Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
