Metadata Management for Textual Documents in Data Lakes

Pegdwend\'e Sawadogo (ERIC); Tokio Kibata; J\'er\^ome Darmont (ERIC)

arXiv:1905.04037·cs.DB·May 13, 2019·26 cites

Metadata Management for Textual Documents in Data Lakes

Pegdwend\'e Sawadogo (ERIC), Tokio Kibata, J\'er\^ome Darmont (ERIC)

PDF

Open Access

TL;DR

This paper presents a specialized metadata management approach for textual documents in data lakes, addressing the gap in handling unstructured data to prevent data swamp issues.

Contribution

It introduces a methodological framework for extracting, storing, and reusing metadata specific to textual documents in data lakes, validated through the COREL project.

Findings

01

Identified key metadata types for textual documents

02

Developed techniques for metadata extraction from text

03

Validated approach within the COREL project

Abstract

Data lakes have emerged as an alternative to data warehouses for the storage, exploration and analysis of big data. In a data lake, data are stored in a raw state and bear no explicit schema. Thence, an efficient metadata system is essential to avoid the data lake turning to a so-called data swamp. Existing works about managing data lake metadata mostly focus on structured and semi-structured data, with little research on unstructured data. Thus, we propose in this paper a methodological approach to build and manage a metadata system that is specific to textual documents in data lakes. First, we make an inventory of usual and meaningful metadata to extract. Then, we apply some specific techniques from the text mining and information retrieval domains to extract, store and reuse these metadata within the COREL research project, in order to validate our proposals.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Semantic Web and Ontologies · Advanced Database Systems and Queries