On data lake architectures and metadata management

Pegdwend\'e Sawadogo (ERIC); J\'er\^ome Darmont (ERIC)

arXiv:2107.11152·cs.DB·July 26, 2021

On data lake architectures and metadata management

Pegdwend\'e Sawadogo (ERIC), J\'er\^ome Darmont (ERIC)

PDF

TL;DR

This paper reviews data lake architectures and metadata management, clarifying their roles and challenges in handling large, diverse data sources for effective big data management.

Contribution

It provides a comprehensive overview of data lake design approaches, focusing on architectures and metadata management, and clarifies misconceptions about data lakes.

Findings

01

Analyzes various data lake architectures.

02

Highlights metadata management challenges.

03

Discusses pros and cons of data lakes.

Abstract

Over the past two decades, we have witnessed an exponential increase of data production in the world. So-called big data generally come from transactional systems, and even more so from the Internet of Things and social media. They are mainly characterized by volume, velocity, variety and veracity issues. Big data-related issues strongly challenge traditional data management and analysis systems. The concept of data lake was introduced to address them. A data lake is a large, raw data repository that stores and manages all company data bearing any format. However, the data lake concept remains ambiguous or fuzzy for many researchers and practitioners, who often confuse it with the Hadoop technology. Thus, we provide in this paper a comprehensive state of the art of the different approaches to data lake design. We particularly focus on data lake architectures and metadata management,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.