Oh That Looks Familiar: A Novel Similarity Measure for Spreadsheet Template Discovery
Anand Krishnakumar, Vengadesh Ravikumaran

TL;DR
This paper introduces a hybrid similarity measure for spreadsheets that combines semantic, data type, and spatial information, enabling more accurate template discovery and clustering.
Contribution
A novel hybrid distance metric for spreadsheets that improves template clustering by integrating semantic embeddings, data types, and spatial layouts.
Findings
Achieved perfect template reconstruction with an Adjusted Rand Index of 1.00.
Outperformed the graph-based Mondrian baseline in clustering tasks.
Enabled large-scale automated template discovery for various applications.
Abstract
Traditional methods for identifying structurally similar spreadsheets fail to capture the spatial layouts and type patterns defining templates. To quantify spreadsheet similarity, we introduce a hybrid distance metric that combines semantic embeddings, data type information, and spatial positioning. In order to calculate spreadsheet similarity, our method converts spreadsheets into cell-level embeddings and then uses aggregation techniques like Chamfer and Hausdorff distances. Experiments across template families demonstrate superior unsupervised clustering performance compared to the graph-based Mondrian baseline, achieving perfect template reconstruction (Adjusted Rand Index of 1.00 versus 0.90) on the FUSTE dataset. Our approach facilitates large-scale automated template discovery, which in turn enables downstream applications such as retrieval-augmented generation over tabular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Spreadsheets and End-User Computing · Data Quality and Management
