Speeding-up $q$-gram mining on grammar-based compressed texts
Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda

TL;DR
This paper introduces an efficient algorithm for computing $q$-gram frequencies directly on grammar-compressed texts, significantly speeding up the process by reducing redundancy and leveraging a trie-like structure.
Contribution
The authors develop a novel linear-time algorithm that improves previous methods by exploiting redundancy in grammar-based compressed texts for $q$-gram frequency computation.
Findings
Algorithm runs in $O( ext{min}igrace{|T|- ext{dup}(q, ext{SLP}), qnigrace})$ time.
Reduces complexity compared to previous $O(qn)$ algorithms when $q = ext{Omega}(|T|/n)$.
Effectively leverages redundancy in SLPs to optimize $q$-gram frequency calculations.
Abstract
We present an efficient algorithm for calculating -gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size that represents string , the algorithm computes the occurrence frequencies of all -grams in , by reducing the problem to the weighted -gram frequencies problem on a trie-like structure of size , where is a quantity that represents the amount of redundancy that the SLP captures with respect to -grams. The reduced problem can be solved in linear time. Since , the running time of our algorithm is , improving our previous algorithm when .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
