TL;DR
This paper introduces lightweight, fixed-size fingerprints for strings that enable fast approximate keyword matching using bitwise operations, significantly speeding up similarity checks for small edit distances.
Contribution
The authors propose a novel fingerprinting method that allows error-tolerant string matching with constant-time bitwise comparisons, improving speed over traditional methods.
Findings
Over 2.5x speedup for Hamming distance at k=1
Over 10x speedup for Levenshtein distance at k=1
Effective on synthetic and real-world data
Abstract
We aim to speed up approximate keyword matching by storing a lightweight, fixed-size block of data for each string, called a fingerprint. These work in a similar way to hash values; however, they can be also used for matching with errors. They store information regarding symbol occurrences using individual bits, and they can be compared against each other with a constant number of bitwise operations. In this way, certain strings can be deduced to be at least within the distance from each other (using Hamming or Levenshtein distance) without performing an explicit verification. We show experimentally that for a preprocessed collection of strings, fingerprints can provide substantial speedups for , namely over times for the Hamming distance and over times for the Levenshtein distance. Tests were conducted on synthetic and real-world English and URL data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
