Evaluating Joinable Column Discovery Approaches for Context-Aware Search

Harsha Kokel; Aamod Khatiwada; Tejaswini Pedapati; Haritha Ananthakrishnan; Oktie Hassanzadeh; Horst Samulowitz; Kavitha Srinivas

arXiv:2510.24599·cs.DB·October 29, 2025

Evaluating Joinable Column Discovery Approaches for Context-Aware Search

Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, Kavitha Srinivas

PDF

TL;DR

This paper conducts a comprehensive evaluation of joinable column discovery methods, analyzing how various criteria and data contexts affect their effectiveness in automating enterprise data analysis.

Contribution

It provides an extensive experimental comparison of syntactic and semantic approaches, highlighting the impact of multiple criteria and ensemble methods across diverse data scenarios.

Findings

01

Metadata and value semantics are vital for data lakes.

02

Size-based criteria are more effective in relational databases.

03

Ensemble ranking methods outperform single-criterion approaches.

Abstract

Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for different data characteristics and how multiple criteria influence discovery effectiveness. We present a comprehensive experimental evaluation of joinable column discovery methods across diverse scenarios. Our study compares syntactic and semantic techniques on seven benchmarks covering relational databases and data lakes. We analyze six key criteria -- unique values, intersection size, join size, reverse join size, value semantics, and metadata semantics -- and examine how combining them through ensemble ranking affects performance. Our analysis reveals differences in method behavior across data contexts and highlights the benefits of integrating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.