Speeding-up $q$-gram mining on grammar-based compressed texts

Keisuke Goto; Hideo Bannai; Shunsuke Inenaga; Masayuki Takeda

arXiv:1202.3311·cs.DS·May 27, 2013

Speeding-up $q$-gram mining on grammar-based compressed texts

Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

PDF

TL;DR

This paper introduces an efficient algorithm for computing $q$-gram frequencies directly on grammar-compressed texts, significantly speeding up the process by reducing redundancy and leveraging a trie-like structure.

Contribution

The authors develop a novel linear-time algorithm that improves previous methods by exploiting redundancy in grammar-based compressed texts for $q$-gram frequency computation.

Findings

01

Algorithm runs in $O( ext{min}igrace{|T|- ext{dup}(q, ext{SLP}), qnigrace})$ time.

02

Reduces complexity compared to previous $O(qn)$ algorithms when $q = ext{Omega}(|T|/n)$.

03

Effectively leverages redundancy in SLPs to optimize $q$-gram frequency calculations.

Abstract

We present an efficient algorithm for calculating $q$ -gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP $T$ of size $n$ that represents string $T$ , the algorithm computes the occurrence frequencies of all $q$ -grams in $T$ , by reducing the problem to the weighted $q$ -gram frequencies problem on a trie-like structure of size $m = ∣ T ∣ - dup (q, T)$ , where $dup (q, T)$ is a quantity that represents the amount of redundancy that the SLP captures with respect to $q$ -grams. The reduced problem can be solved in linear time. Since $m = O (q n)$ , the running time of our algorithm is $O (min {∣ T ∣ - dup (q, T), q n})$ , improving our previous $O (q n)$ algorithm when $q = Ω (∣ T ∣/ n)$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.