Lightweight Fingerprints for Fast Approximate Keyword Matching Using   Bitwise Operations

Aleksander Cis{\l}ak; Szymon Grabowski

arXiv:1711.08475·cs.DS·November 27, 2017

Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

Aleksander Cis{\l}ak, Szymon Grabowski

PDF

1 Repo

TL;DR

This paper introduces lightweight, fixed-size fingerprints for strings that enable fast approximate keyword matching using bitwise operations, significantly speeding up similarity checks for small edit distances.

Contribution

The authors propose a novel fingerprinting method that allows error-tolerant string matching with constant-time bitwise comparisons, improving speed over traditional methods.

Findings

01

Over 2.5x speedup for Hamming distance at k=1

02

Over 10x speedup for Levenshtein distance at k=1

03

Effective on synthetic and real-world data

Abstract

We aim to speed up approximate keyword matching by storing a lightweight, fixed-size block of data for each string, called a fingerprint. These work in a similar way to hash values; however, they can be also used for matching with errors. They store information regarding symbol occurrences using individual bits, and they can be compared against each other with a constant number of bitwise operations. In this way, certain strings can be deduced to be at least within the distance $k$ from each other (using Hamming or Levenshtein distance) without performing an explicit verification. We show experimentally that for a preprocessed collection of strings, fingerprints can provide substantial speedups for $k = 1$ , namely over $2.5$ times for the Hamming distance and over $10$ times for the Levenshtein distance. Tests were conducted on synthetic and real-world English and URL data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MrAlexSee/Fingerprints
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.