Optimizing Exact String Matching via Statistical Anchoring

Omar Garraoui

arXiv:2601.03271·cs.DS·January 13, 2026

Optimizing Exact String Matching via Statistical Anchoring

Omar Garraoui

PDF

Open Access

TL;DR

This paper introduces a statistical anchoring technique to optimize string matching by preprocessing patterns to identify a low-frequency character, enabling faster verification and improved efficiency in natural language processing tasks.

Contribution

It presents a novel enhancement to the Boyer-Moore-Horspool algorithm using linguistic statistics to identify an anchor character for faster pattern matching.

Findings

01

Significant reduction in comparison counts during search

02

Improved matching speed without added algorithm complexity

03

Effective for natural language text processing

Abstract

In this work, we propose an enhancement to the Boyer-Moore-Horspool algorithm tailored for natural language text. The approach involves preprocessing the search pattern to identify its statistically least frequent character, referred to as the "anchor." During the search, verification is first performed at this high-entropy position, allowing the algorithm to quickly discard non-matching windows. This fail-fast strategy reduces unnecessary comparisons, improving overall efficiency. Our implementation shows that incorporating basic linguistic statistics into classical pattern-matching techniques can boost performance without increasing complexity to the shift heuristics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Network Packet Processing and Optimization