Exploiting Formal Concept Analysis for Data Modeling in Data Lakes
Anes Bendimerad, Romain Mathonat, Youcef Remil, Mehdi Kaytoue

TL;DR
This paper presents a Formal Concept Analysis-based approach to unify and simplify data structures in data lakes, significantly reducing complexity and enhancing data accessibility for analytics.
Contribution
It introduces a novel FCA-driven methodology for consolidating heterogeneous data structures into a unified schema within data lakes.
Findings
54% reduction in distinct data structure field names
80% data structure coverage with 34 field names
Effective identification of common concepts across data structures
Abstract
Data lakes are widely used to store extensive and heterogeneous datasets for advanced analytics. However, the unstructured nature of data in these repositories introduces complexities in exploiting them and extracting meaningful insights. This motivates the need of exploring efficient approaches for consolidating data lakes and deriving a common and unified schema. This paper introduces a practical data visualization and analysis approach rooted in Formal Concept Analysis (FCA) to systematically clean, organize, and design data structures within a data lake. We explore diverse data structures stored in our data lake at Infologic, including InfluxDB measurements and Elasticsearch indexes, aiming to derive conventions for a more accessible data model. Leveraging FCA, we represent data structures as objects, analyze the concept lattice, and present two strategies-top-down and bottom-up-to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
