LSHBloom: Memory-efficient, Extreme-scale Document Deduplication
Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster

TL;DR
LSHBloom is a memory-efficient, scalable document deduplication method that maintains high accuracy while significantly reducing runtime and storage costs, enabling large-scale dataset curation for training large language models.
Contribution
It introduces LSHBloom, an extension of MinhashLSH using Bloom filters, achieving comparable deduplication performance with much lower memory and faster processing at extreme scales.
Findings
Achieves state-of-the-art deduplication with minimal false positives.
Uses 18 times less disk space than MinhashLSH.
Operates 12 times faster than MinhashLSH on large datasets.
Abstract
Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication -- detecting and eliminating additional instances of the same content -- is a major focus for assembling and curating training datasets for LLMs. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Unfortunately, contemporary approaches to document-level deduplication are either unreliable at accurately identifying duplicate documents or extremely expensive in terms of both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Advanced Data Storage Technologies
MethodsBLOOM · Focus
