Improved Fast Similarity Search in Dictionaries

Daniel Karch; Dennis Luxen; Peter Sanders

arXiv:1008.1191·cs.IR·August 19, 2010

Improved Fast Similarity Search in Dictionaries

Daniel Karch, Dennis Luxen, Peter Sanders

PDF

Open Access

TL;DR

This paper introduces an optimized algorithm and data structures for fast approximate dictionary matching, enabling near-instantaneous retrieval of similar words within large datasets.

Contribution

The authors develop a novel, memory-efficient indexing method that significantly accelerates approximate string matching in large dictionaries.

Findings

01

Supports fault-tolerant queries with high speed

02

Reduces memory consumption and preprocessing time

03

Achieves microsecond query times on large datasets

Abstract

We engineer an algorithm to solve the approximate dictionary matching problem. Given a list of words $W$ , maximum distance $d$ fixed at preprocessing time and a query word $q$ , we would like to retrieve all words from $W$ that can be transformed into $q$ with $d$ or less edit operations. We present data structures that support fault tolerant queries by generating an index. On top of that, we present a generalization of the method that eases memory consumption and preprocessing time significantly. At the same time, running times of queries are virtually unaffected. We are able to match in lists of hundreds of thousands of words and beyond within microseconds for reasonable distances.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Management and Algorithms · Algorithms and Data Compression · Video Analysis and Summarization