Replace or Retrieve Keywords In Documents at Scale
Vikash Singh

TL;DR
The paper introduces FlashText, an efficient algorithm for keyword search and replacement in documents, which operates in linear time and outperforms regex and Aho-Corasick in speed, focusing on complete word matches.
Contribution
The paper presents FlashText, a novel keyword search and replace algorithm that is faster and simpler than existing methods, with a focus on complete word matching and longest match selection.
Findings
FlashText operates in O(N) time, independent of dictionary size.
It outperforms regex and Aho-Corasick in speed for keyword operations.
The implementation is available as open-source on GitHub.
Abstract
In this paper we introduce, the FlashText algorithm for replacing keywords or finding keywords in a given text. FlashText can search or replace keywords in one pass over a document. The time complexity of this algorithm is not dependent on the number of terms being searched or replaced. For a document of size N (characters) and a dictionary of M keywords, the time complexity will be O(N). This algorithm is much faster than Regex, because regex time complexity is O(MxN). It is also different from Aho Corasick Algorithm, as it doesn't match substrings. FlashText is designed to only match complete words (words with boundary characters on both sides). For an input dictionary of {Apple}, this algorithm won't match it to 'I like Pineapple'. This algorithm is also designed to go for the longest match first. For an input dictionary {Machine, Learning, Machine learning} on a string 'I like…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing
