Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns
Yuxiang Guo, Yuren Mao, Zhonghao Hu, Lu Chen, Yunjun Gao

TL;DR
Snoopy introduces a novel proxy-column-based embedding framework for semantic join discovery that significantly improves effectiveness and efficiency over existing methods, enabling scalable dataset integration.
Contribution
The paper proposes a new column embedding approach using proxy columns and a rank-aware contrastive learning paradigm to bridge effectiveness and efficiency in semantic join discovery.
Findings
Outperforms state-of-the-art column-level methods by 16% in Recall@25 and 10% in NDCG@25.
Achieves at least 5 orders of magnitude faster efficiency than cell-level methods.
Runs 3.5 times faster than existing column-level methods.
Abstract
Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Video Analysis and Summarization · Spam and Phishing Detection
MethodsContrastive Learning
