Noise-Robust De-Duplication at Scale
Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell

TL;DR
This paper introduces a noise-robust de-duplication method using neural models, demonstrating significant improvements over traditional N-gram approaches on a large, newly created dataset of historical news articles.
Contribution
It develops and evaluates neural bi-encoder and re-rank models for large-scale de-duplication, outperforming traditional hashing and N-gram methods, and provides a new dataset and tools for the community.
Findings
Neural approaches outperform hashing and N-gram overlap methods.
Bi-encoder scales efficiently to 10 million articles on a single GPU.
Pre-trained models identify duplicates missed by traditional methods.
Abstract
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Malware Detection Techniques
MethodsTest
