Modeling Data Lake Metadata with a Data Vault
Iuri Nogueira (UL2), Maram Romdhane (UL2), J\'er\^ome Darmont (ERIC)

TL;DR
This paper proposes using a data vault, an ensemble modeling technique, to manage metadata in data lakes, addressing schema evolution issues and improving access efficiency.
Contribution
It introduces a novel approach of applying data vault modeling to data lake metadata, with practical instantiations in relational and document-oriented models.
Findings
Relational and document models successfully instantiate the metadata model.
The models improve metadata management and access efficiency.
Comparison shows trade-offs in storage and query response times.
Abstract
With the rise of big data, business intelligence had to find solutions for managing even greater data volumes and variety than in data warehouses, which proved ill-adapted. Data lakes answer these needs from a storage point of view, but require managing adequate metadata to guarantee an efficient access to data. Starting from a multidimensional metadata model designed for an industrial heritage data lake presenting a lack of schema evolutivity, we propose in this paper to use ensemble modeling, and more precisely a data vault, to address this issue. To illustrate the feasibility of this approach, we instantiate our metadata conceptual model into relational and document-oriented logical and physical models, respectively. We also compare the physical models in terms of metadata storage and query response time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
