SAVeD: Semantic Aware Version Discovery

Artem Frenk; Roee Shraga

arXiv:2511.17298·cs.LG·January 13, 2026

SAVeD: Semantic Aware Version Discovery

Artem Frenk, Roee Shraga

PDF

Open Access

TL;DR

SAVeD is a contrastive learning framework that accurately detects dataset versions without relying on metadata, using semantic embeddings and data augmentations to distinguish similar and different datasets.

Contribution

The paper introduces SAVeD, a novel contrastive learning approach employing a custom transformer encoder and data augmentations for semantic-aware dataset version detection without metadata.

Findings

01

Achieves higher accuracy on unseen tables

02

Significantly improves separation scores

03

Outperforms prior state-of-the-art methods

Abstract

Our work introduces SAVeD (Semantically Aware Version Detection), a contrastive learning-based framework for identifying versions of structured datasets without relying on metadata, labels, or integration-based assumptions. SAVeD addresses a common challenge in data science of repeated labor due to a difficulty of similar work or transformations on datasets. SAVeD employs a modified SimCLR pipeline, generating augmented table views through random transformations (e.g., row deletion, encoding perturbations). These views are embedded via a custom transformer encoder and contrasted in latent space to optimize semantic similarity. Our model learns to minimize distances between augmented views of the same dataset and maximize those between unrelated tables. We evaluate performance using validation accuracy and separation, defined respectively as the proportion of correctly classified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Digital Humanities and Scholarship · Authorship Attribution and Profiling