A Fast Randomized Algorithm for Massive Text Normalization
Nan Jiang, Chen Luo, Vihan Lakshman, Yesh Dattatreya, Yexiang Xue

TL;DR
FLAN is a scalable randomized algorithm that efficiently cleans and normalizes massive text datasets by leveraging Locality Sensitive Hashing and a novel stabilization process, improving over existing methods without needing annotated data.
Contribution
The paper introduces FLAN, a novel scalable randomized text normalization algorithm that uses LSH and a stabilization process, eliminating the need for supervised learning or additional features.
Findings
FLAN is more efficient than existing approaches.
FLAN effectively handles massive datasets with high accuracy.
Theoretical bounds demonstrate robustness of FLAN.
Abstract
Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate. Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data. Our algorithm relies on the Jaccard similarity between words to suggest correction results. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). We also propose a novel stabilization process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Advanced Image and Video Retrieval Techniques
