Exploiting New Properties of String Net Frequency for Efficient   Computation

Peaker Guo; Patrick Eades; Anthony Wirth; Justin Zobel

arXiv:2404.12701·cs.DS·April 24, 2024

Exploiting New Properties of String Net Frequency for Efficient Computation

Peaker Guo, Patrick Eades, Anthony Wirth, Justin Zobel

PDF

1 Repo

TL;DR

This paper introduces a novel, efficient method for computing net frequency of strings in large texts, which helps identify significant, maximal-length strings for applications like text compression and tokenization.

Contribution

The paper presents a new characteristic of net frequency, along with algorithms leveraging suffix arrays and Burrows-Wheeler transform to compute net frequency efficiently, a previously unexplored area.

Findings

01

Method is approximately 100 times faster than baselines for single net frequency computation.

02

New algorithms solve net frequency problems with linear construction cost.

03

Efficiently reports all strings with positive net frequency in linear time.

Abstract

Knowing which strings in a massive text are significant -- that is, which strings are common and distinct from other strings -- is valuable for several applications, including text compression and tokenization. Frequency in itself is not helpful for significance, because the commonest strings are the shortest strings. A compelling alternative is net frequency, which has the property that strings with positive net frequency are of maximal length. However, net frequency remains relatively unexplored, and there is no prior art showing how to compute it efficiently. We first introduce a characteristic of net frequency that simplifies the original definition. With this, we study strings with positive net frequency in Fibonacci words. We then use our characteristic and solve two key problems related to net frequency. First, \textsc{single-nf}, how to compute the net frequency of a given…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

peakergzf/string-net-frequency
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.