Joint Management and Analysis of Textual Documents and Tabular Data within the AUDAL Data Lake
Pegdwend\'e Sawadogo (ERIC), J\'er\^ome Darmont (ERIC), Camille No\^us

TL;DR
This paper presents a novel approach to designing a data lake that integrates textual and tabular data, utilizing an extensive metadata system to enable advanced querying and analysis, demonstrated through real-world and benchmark evaluations.
Contribution
It introduces a new data lake design with a comprehensive metadata system that supports joint management of textual and tabular data, enabling richer analysis features.
Findings
Successful implementation in the AUDAL data lake
Enhanced data retrieval and content analysis capabilities
Validated effectiveness through real-world and benchmark tests
Abstract
In 2010, the concept of data lake emerged as an alternative to data warehouses for big data management. Data lakes follow a schema-on-read approach to provide rich and flexible analyses. However, although trendy in both the industry and academia, the concept of data lake is still maturing, and there are still few methodological approaches to data lake design. Thus, we introduce a new approach to design a data lake and propose an extensive metadata system to activate richer features than those usually supported in data lake approaches. We implement our approach in the AUDAL data lake, where we jointly exploit both textual documents and tabular data, in contrast with structured and/or semi-structured data typically processed in data lakes from the literature. Furthermore, we also innovate by leveraging metadata to activate both data retrieval and content analysis, including Text-OLAP and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Advanced Database Systems and Queries · Semantic Web and Ontologies
