Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

TL;DR
This paper presents an efficient method for computing non-overlapping q-gram frequencies directly from SLP-compressed texts, significantly improving previous solutions for all q values.
Contribution
It introduces an algorithm that computes non-overlapping q-gram frequencies in O(q^2 n) time and O(q n) space from SLP-compressed texts, generalizing prior work limited to q=2.
Findings
Algorithm runs in O(q^2 n) time
Uses O(q n) space
Generalizes previous q=2 solution
Abstract
Length- substrings, or -grams, can represent important characteristics of text data, and determining the frequencies of all -grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlapping frequencies} of all -grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in time and space where is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for in time and space.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory
