On Tuning the Bad-Character Rule: the Worst-Character Rule
Domenico Cantone, Simone Faro

TL;DR
This paper introduces the worst-character rule, an optimized variation of the bad-character heuristic in the Boyer-Moore string matching algorithm, improving efficiency based on character distribution.
Contribution
It proposes the worst-character rule, selecting shifts that maximize average advancement, enhancing string matching performance especially for long patterns and small alphabets.
Findings
Achieves better average shift in random texts with long patterns
Performs well on natural language texts
Effective for small alphabet sizes
Abstract
In this note we present the worst-character rule, an efficient variation of the bad-character heuristic for the exact string matching problem, firstly introduced in the well-known Boyer-Moore algorithm. Our proposed rule selects a position relative to the current shift which yields the largest average advancement, according to the characters distribution in the text. Experimental results show that the worst-character rule achieves very good results especially in the case of long patterns or small alphabets in random texts and in the case of texts in natural languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · semigroups and automata theory · Natural Language Processing Techniques
