TL;DR
This paper introduces a novel, efficient method for computing net frequency of strings in large texts, which helps identify significant, maximal-length strings for applications like text compression and tokenization.
Contribution
The paper presents a new characteristic of net frequency, along with algorithms leveraging suffix arrays and Burrows-Wheeler transform to compute net frequency efficiently, a previously unexplored area.
Findings
Method is approximately 100 times faster than baselines for single net frequency computation.
New algorithms solve net frequency problems with linear construction cost.
Efficiently reports all strings with positive net frequency in linear time.
Abstract
Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
