R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
Takaaki Nishimoto, Yasuo Tabei

TL;DR
This paper introduces r-enum, a space-efficient algorithm for enumerating characteristic substrings in strings using run-length encoded Burrows-Wheeler transform, optimized for highly repetitive strings and large datasets.
Contribution
The paper presents the first RLBWT-based enumeration algorithm for characteristic substrings, achieving improved space efficiency for highly repetitive strings.
Findings
Runs in $O(n \, \log \log (n/r))$ time
Uses $O(r \log n)$ bits of space, with $r$ being the number of RLBWT runs
More space-efficient than previous methods on benchmark datasets
Abstract
Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · RNA and protein synthesis mechanisms
