A Big Data Lake for Multilevel Streaming Analytics
Ruoran Liu, Haruna Isah, Farhana Zulkernine

TL;DR
This paper discusses designing and implementing a scalable data lake architecture using Hadoop for multilevel streaming analytics, addressing challenges posed by high-volume, high-velocity, and diverse data sources.
Contribution
It presents a comprehensive approach to building a data lake with Hadoop, including a real-world use case for streaming data ingestion and analytics.
Findings
Identified limitations of traditional data warehouses for modern data paradigms.
Compared various open source and commercial data lake platforms.
Demonstrated a practical implementation with a real-world use case.
Abstract
Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Cloud Computing and Resource Management · Big Data and Business Intelligence
