Leveraging Data Preparation, HBase NoSQL Storage, and HiveQL Querying   for COVID-19 Big Data Analytics Projects

Karim Ba\"ina

arXiv:2004.00253·cs.DB·April 2, 2020·1 cites

Leveraging Data Preparation, HBase NoSQL Storage, and HiveQL Querying for COVID-19 Big Data Analytics Projects

Karim Ba\"ina

PDF

Open Access

TL;DR

This paper presents a detailed approach for preparing, storing, and querying COVID-19 data using HBase NoSQL and HiveQL to streamline data analysis workflows for researchers.

Contribution

It introduces specific schemas and scripts for efficient COVID-19 data formatting, storage, and querying, reducing data preparation efforts in analytics projects.

Findings

01

Significant reduction in data preparation time for COVID-19 analysis

02

Effective use of HBase and HiveQL for large-scale COVID-19 data management

03

Enhanced accessibility of COVID-19 data for researchers

Abstract

Epidemiologist, Scientists, Statisticians, Historians, Data engineers and Data scientists are working on finding descriptive models and theories to explain COVID-19 expansion phenomena or on building analytics predictive models for learning the apex of COVID-19 confimed cases, recovered cases, and deaths evolution curves. In CRISP-DM life cycle, 75% of time is consumed only by data preparation phase causing lot of pressions and stress on scientists and data scientists building machine learning models. This paper aims to help reducing data preparation efforts by presenting detailed schemas design and data preparation technical scripts for formatting and storing Johns Hopkins University COVID-19 daily data in HBase NoSQL data store, and enabling HiveQL COVID-19 data querying in a relational Hive SQL-like style.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Distributed systems and fault tolerance · Software System Performance and Reliability