Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index
Kiavash Kianfar, Christopher Pockrandt, Bahman Torkamandi, Haochen, Luo, Knut Reinert

TL;DR
This paper introduces a mixed integer programming approach to optimize search schemes for approximate string matching with bidirectional FM-indexes, significantly improving search speed and efficiency in bioinformatics applications.
Contribution
It presents the first MIP-based method to find optimal search schemes for approximate matching, enhancing performance over previous ad-hoc solutions.
Findings
Optimal search schemes significantly speed up approximate matching.
Approximate matching of 101-bp reads with two errors is 35 times faster.
Search time with optimal schemes is comparable to state-of-the-art aligners.
Abstract
Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies
