SimClone: Detecting Tabular Data Clones using Value Similarity
Xu Yang, Gopi Krishnan Rajbahadur, Dayi Lin, Shaowei Wang, Zhen Ming, (Jack) Jiang

TL;DR
SimClone is a novel method that detects data clones in tabular datasets using value similarity without structural info, outperforming existing methods and providing visualization for precise clone localization.
Contribution
This paper introduces SimClone, a new approach for detecting data clones in tabular data based solely on value similarity, with an integrated visualization tool.
Findings
Outperforms state-of-the-art by at least 20% in F1-score and AUC.
Achieves a Precision@10 of 0.80 in clone localization.
Effective in datasets lacking structural information.
Abstract
Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Data Stream Mining Techniques · Data Mining Algorithms and Applications
