A Fast Heuristic for Exact String Matching

Srikrishnan Divakaran

arXiv:1512.03512·cs.DS·December 14, 2015

A Fast Heuristic for Exact String Matching

Srikrishnan Divakaran

PDF

TL;DR

This paper introduces a randomized heuristic for exact string matching that preprocesses a pattern to identify a sparse substring, enabling faster search times especially for patterns with certain character distributions.

Contribution

The paper proposes a novel randomized heuristic that preprocesses patterns to efficiently find all occurrences in the text, improving search times based on sparse substring identification.

Findings

01

Preprocessing time is $O(n ext{delta})$.

02

Expected search time is $O( m / ext{min}(|sparse(P)|, ext{Delta}) )$.

03

Expected sparse substring length is $ ext{Omega}( ext{Delta} imes ext{log}(rac{2 ext{Delta}}{2 ext{Delta}- ext{delta}}))$ for random patterns.

Abstract

Given a pattern string $P$ of length $n$ consisting of $δ$ distinct characters and a query string $T$ of length $m$ , where the characters of $P$ and $T$ are drawn from an alphabet $Σ$ of size $Δ$ , the {\em exact string matching} problem consists of finding all occurrences of $P$ in $T$ . For this problem, we present a randomized heuristic that in $O (n δ)$ time preprocesses $P$ to identify $s p a r se (P)$ , a rarely occurring substring of $P$ , and then use it to find all occurrences of $P$ in $T$ efficiently. This heuristic has an expected search time of $O (\frac{m}{min ( ∣ s p a r se ( P ) ∣ , Δ )})$ , where $∣ s p a r se (P) ∣$ is at least $δ$ . We also show that for a pattern string $P$ whose characters are chosen uniformly at random from an alphabet of size $Δ$ , $E [∣ s p a r se (P) ∣]$ is $Ω (Δ l o g (\frac{2Δ}{2Δ - δ}))$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.