LSH methods for data deduplication in a Wikipedia artificial dataset

Juan Ciro; Daniel Galvez; Tim Schlippe; David Kanter

arXiv:2112.11478·cs.CL·December 23, 2021

LSH methods for data deduplication in a Wikipedia artificial dataset

Juan Ciro, Daniel Galvez, Tim Schlippe, David Kanter

PDF

Open Access

TL;DR

This paper evaluates various locality sensitive hashing (LSH) models for data deduplication in a Wikipedia-based dataset, demonstrating high accuracy and improved model training quality.

Contribution

It introduces an artificial Wikipedia dataset for evaluating LSH models and compares their effectiveness in deduplication tasks.

Findings

01

Most models achieved AUC over 0.9

02

The best model reached an AUC of 0.96

03

Deduplication improves model training effectiveness

Abstract

This paper illustrates locality sensitive hasing (LSH) models for the identification and removal of nearly redundant data in a text dataset. To evaluate the different models, we create an artificial dataset for data deduplication using English Wikipedia articles. Area-Under-Curve (AUC) over 0.9 were observed for most models, with the best model reaching 0.96. Deduplication enables more effective model training by preventing the model from learning a distribution that differs from the real one as a result of the repeated data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Big Data Technologies and Applications · Privacy-Preserving Technologies in Data