A Big Data Lake for Multilevel Streaming Analytics

Ruoran Liu; Haruna Isah; Farhana Zulkernine

arXiv:2009.12415·cs.DC·September 29, 2020

A Big Data Lake for Multilevel Streaming Analytics

Ruoran Liu, Haruna Isah, Farhana Zulkernine

PDF

Open Access

TL;DR

This paper discusses designing and implementing a scalable data lake architecture using Hadoop for multilevel streaming analytics, addressing challenges posed by high-volume, high-velocity, and diverse data sources.

Contribution

It presents a comprehensive approach to building a data lake with Hadoop, including a real-world use case for streaming data ingestion and analytics.

Findings

01

Identified limitations of traditional data warehouses for modern data paradigms.

02

Compared various open source and commercial data lake platforms.

03

Demonstrated a practical implementation with a real-world use case.

Abstract

Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Cloud Computing and Resource Management · Big Data and Business Intelligence