Substring Complexities on Run-length Compressed Strings

Akiyoshi Kawamoto; Tomohiro I

arXiv:2205.12421·cs.DS·May 26, 2022

Substring Complexities on Run-length Compressed Strings

Akiyoshi Kawamoto, Tomohiro I

PDF

Open Access

TL;DR

This paper introduces an efficient method to compute the substring complexity measure elta in run-length compressed strings, enabling better analysis of string repetitiveness with optimal time and space complexity.

Contribution

It presents a novel algorithm to compute elta directly from run-length compressed strings in near-optimal time and linear space, improving analysis of repetitive string structures.

Findings

01

elta can be computed in _{sort}(r, n) time

02

The algorithm operates in O(r) space complexity

03

Efficient analysis of highly-repetitive strings is enabled

Abstract

Let $S_{T} (k)$ denote the set of distinct substrings of length $k$ in a string $T$ , then the $k$ -th substring complexity is defined by its cardinality $∣ S_{T} (k) ∣$ . Recently, $δ = max {∣ S_{T} (k) ∣/ k : k \geq 1}$ is shown to be a good compressibility measure of highly-repetitive strings. In this paper, given $T$ of length $n$ in the run-length compressed form of size $r$ , we show that $δ$ can be computed in $C_{sort} (r, n)$ time and $O (r)$ space, where $C_{sort} (r, n) = O (min (r l g l g r, r l g_{r} n))$ is the time complexity for sorting $r$ $O (l g n)$ -bit integers in $O (r)$ space in the Word-RAM model with word size $Ω (l g n)$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing