On Tuning the Bad-Character Rule: the Worst-Character Rule

Domenico Cantone; Simone Faro

arXiv:1012.1338·cs.DS·December 8, 2010·1 cites

On Tuning the Bad-Character Rule: the Worst-Character Rule

Domenico Cantone, Simone Faro

PDF

Open Access

TL;DR

This paper introduces the worst-character rule, an optimized variation of the bad-character heuristic in the Boyer-Moore string matching algorithm, improving efficiency based on character distribution.

Contribution

It proposes the worst-character rule, selecting shifts that maximize average advancement, enhancing string matching performance especially for long patterns and small alphabets.

Findings

01

Achieves better average shift in random texts with long patterns

02

Performs well on natural language texts

03

Effective for small alphabet sizes

Abstract

In this note we present the worst-character rule, an efficient variation of the bad-character heuristic for the exact string matching problem, firstly introduced in the well-known Boyer-Moore algorithm. Our proposed rule selects a position relative to the current shift which yields the largest average advancement, according to the characters distribution in the text. Experimental results show that the worst-character rule achieves very good results especially in the case of long patterns or small alphabets in random texts and in the case of texts in natural languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques