LSH methods for data deduplication in a Wikipedia artificial dataset
Juan Ciro, Daniel Galvez, Tim Schlippe, David Kanter

TL;DR
This paper evaluates various locality sensitive hashing (LSH) models for data deduplication in a Wikipedia-based dataset, demonstrating high accuracy and improved model training quality.
Contribution
It introduces an artificial Wikipedia dataset for evaluating LSH models and compares their effectiveness in deduplication tasks.
Findings
Most models achieved AUC over 0.9
The best model reached an AUC of 0.96
Deduplication improves model training effectiveness
Abstract
This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Big Data Technologies and Applications · Privacy-Preserving Technologies in Data
