Pre-feasibility Study of Astronomical Data Archive Systems Powered by Public Cloud Computing and Hadoop Hive
Satoshi Eguchi

TL;DR
This study explores the feasibility of using public cloud computing and Hadoop Hive for managing and analyzing large astronomical data archives, comparing performance and stability of different cloud setups.
Contribution
It demonstrates the potential of cloud-based Hadoop and Hive clusters for astronomical data processing and evaluates various partitioning algorithms for optimized performance.
Findings
EMR clusters show stable performance compared to VPS clusters.
Partitioning algorithms significantly impact Hive query efficiency.
Cloud solutions can reduce costs for large-scale astronomical data analysis.
Abstract
The size of astronomical observational data is increasing yearly. For example, while Atacama Large Millimeter/submillimeter Array is expected to generate 200 TB raw data every year, Large Synoptic Survey Telescope is estimated to produce 15 TB raw data every night. Since the increasing rate of computing is much lower than that of astronomical data, to provide high performance computing (HPC) resources together with scientific data will be common in the next decade. However, the installation and maintenance costs of a HPC system can be burdensome for the provider. I note public cloud computing for an alternative way to get sufficient computing resources inexpensively. I build Hadoop and Hive clusters by utilizing a virtual private server (VPS) service and Amazon Elastic MapReduce (EMR), and measure their performances. The VPS cluster behaves differently day by day, while the EMR clusters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Caching and Content Delivery · Peer-to-Peer Network Technologies
