SAVeD: Semantic Aware Version Discovery
Artem Frenk, Roee Shraga

TL;DR
SAVeD is a contrastive learning framework that accurately detects dataset versions without relying on metadata, using semantic embeddings and data augmentations to distinguish similar and different datasets.
Contribution
The paper introduces SAVeD, a novel contrastive learning approach employing a custom transformer encoder and data augmentations for semantic-aware dataset version detection without metadata.
Findings
Achieves higher accuracy on unseen tables
Significantly improves separation scores
Outperforms prior state-of-the-art methods
Abstract
Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Digital Humanities and Scholarship · Authorship Attribution and Profiling
