Shape-Agnostic Table Overlap Discovery: A Maximum Common Subhypergraph Approach
Ge Lee, Shixun Huang, Zhifeng Bao, Felix Naumann, Shazia Sadiq, Yanchang Zhao

TL;DR
This paper introduces SALTO, a hypergraph-based method for discovering arbitrary-shaped overlaps between tables, addressing limitations of rectangular overlap definitions and enabling more flexible data matching.
Contribution
We propose SALTO, a novel hypergraph model and HyperSplit algorithm that efficiently computes non-contiguous table overlaps, improving over prior rectangular overlap methods.
Findings
HyperSplit discovers larger overlaps in up to 78.8% of cases.
It outperforms state-of-the-art methods in effectiveness and efficiency.
Case studies demonstrate practical benefits in data deduplication and version comparison.
Abstract
Understanding how two tables overlap is useful for many data management tasks, but challenging because tables often differ in row and column orders and lack reliable metadata in practice. Prior work defines the largest rectangular overlap, which identifies the maximal contiguous region of matching cells under row and column permutations. However, real overlaps are rarely rectangular, where many valid matches may lie outside any single contiguous block. In this paper, we introduce the Shape-Agnostic Largest Table Overlap (SALTO), a novel generalized notion of overlap that captures arbitrary-shaped, non-contiguous overlaps between tables. To tackle the combinatorial complexity of row and column permutations, we propose to model each table as a hypergraph, casting SALTO computation into a maximum common subhypergraph problem. We prove their equivalence and show the problem is NP-hard to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Time Series Analysis and Forecasting · Data Visualization and Analytics
