FREYJA: Efficient Join Discovery in Data Lakes

Marc Maynou; Sergi Nadal; Raquel Panadero; Javier Flores; Oscar Romero; Anna Queralt

arXiv:2412.06637·cs.DB·January 23, 2026

FREYJA: Efficient Join Discovery in Data Lakes

Marc Maynou, Sergi Nadal, Raquel Panadero, Javier Flores, Oscar Romero, Anna Queralt

PDF

Open Access

TL;DR

FREYJA is a data discovery system for data lakes that efficiently identifies relevant join candidates using a novel, scalable join quality metric based on data profiles, achieving high accuracy with significantly reduced computation.

Contribution

FREYJA introduces a scalable join quality measure and a predictive model leveraging data profiles, outperforming existing methods in efficiency while maintaining accuracy.

Findings

01

FREYJA matches state-of-the-art accuracy in join discovery.

02

It reduces execution times by several orders of magnitude.

03

The system effectively explores large data lakes for downstream tasks.

Abstract

Data lakes are massive repositories of raw and heterogeneous data, designed to meet the requirements of modern data storage. Nonetheless, this same philosophy increases the complexity of performing discovery tasks to find relevant data for subsequent processing. As a response to these growing challenges, we present FREYJA, a modern data discovery system capable of effectively exploring data lakes, aimed at finding candidates to perform joins and increase the number of attributes for downstream tasks. More precisely, we want to compute rankings that sort potential joins by their relevance. Modern mechanisms apply advanced table representation learning (TRL) techniques to yield accurate joins. Yet, this incurs high computational costs when dealing with elevated volumes of data. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Data Mining Algorithms and Applications · Data Quality and Management