Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Keisuke Goto; Hideo Bannai; Shunsuke Inenaga; Masayuki Takeda

arXiv:1107.3022·cs.DS·July 18, 2011

Computing q-gram Non-overlapping Frequencies on SLP Compressed Texts

Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

PDF

Open Access

TL;DR

This paper presents an efficient method for computing non-overlapping q-gram frequencies directly from SLP-compressed texts, significantly improving previous solutions for all q values.

Contribution

It introduces an algorithm that computes non-overlapping q-gram frequencies in O(q^2 n) time and O(q n) space from SLP-compressed texts, generalizing prior work limited to q=2.

Findings

01

Algorithm runs in O(q^2 n) time

02

Uses O(q n) space

03

Generalizes previous q=2 solution

Abstract

Length- $q$ substrings, or $q$ -grams, can represent important characteristics of text data, and determining the frequencies of all $q$ -grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the {\em non-overlapping frequencies} of all $q$ -grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in $O (q^{2} n)$ time and $O (q n)$ space where $n$ is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for $q = 2$ in $O (n^{4} lo g n)$ time and $O (n^{3})$ space.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory