Noise-Robust De-Duplication at Scale

Emily Silcock; Luca D'Amico-Wong; Jinglin Yang; Melissa Dell

arXiv:2210.04261·cs.CL·April 25, 2024·1 cites

Noise-Robust De-Duplication at Scale

Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a noise-robust de-duplication method using neural models, demonstrating significant improvements over traditional N-gram approaches on a large, newly created dataset of historical news articles.

Contribution

It develops and evaluates neural bi-encoder and re-rank models for large-scale de-duplication, outperforming traditional hashing and N-gram methods, and provides a new dataset and tools for the community.

Findings

01

Neural approaches outperform hashing and N-gram overlap methods.

02

Bi-encoder scales efficiently to 10 million articles on a single GPU.

03

Pre-trained models identify duplicates missed by traditional methods.

Abstract

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dell-research-harvard/NEWS-COPY
noneOfficial

Videos

Noise-Robust De-Duplication at Scale· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Malware Detection Techniques

MethodsTest