String Sampling with Bidirectional String Anchors
Grigorios Loukides, Solon P. Pissis, and Michelle Sweering

TL;DR
This paper introduces bidirectional string anchors (bd-anchors), a new string sampling method that improves sample size efficiency and supports fast online pattern searches, addressing limitations of traditional minimizers sampling.
Contribution
The paper proposes bd-anchors, a novel string sampling mechanism with theoretical analysis, practical efficiency, and applications to indexing and pattern search.
Findings
bd-anchors samples decrease proportionally to ll
Sample sizes are competitive or smaller than minimizers
Index construction over bd-anchors is near-linear and space-efficient
Abstract
The minimizers sampling mechanism is a popular mechanism for string sampling introduced independently by Schleimer et al. [SIGMOD 2003] and by Roberts et al. [Bioinf. 2004]. Given two positive integers and , it selects the lexicographically smallest length- substring in every fragment of consecutive length- substrings (in every sliding window of length ). Minimizers samples are approximately uniform, locally consistent, and computable in linear time. Two main disadvantages of minimizers sampling mechanisms are: first, they do not have good guarantees on the expected size of their samples for every combination of and ; and, second, indexes that are constructed over their samples do not have good worst-case guarantees for on-line pattern searches. We introduce bidirectional string anchors (bd-anchors), a new string sampling mechanism. Given a positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Natural Language Processing Techniques
