The Complexity of the Co-Occurrence Problem

Philip Bille; Inge Li G{\o}rtz; Tord Stordalen

arXiv:2206.10383·cs.DS·November 11, 2022

The Complexity of the Co-Occurrence Problem

Philip Bille, Inge Li G{\o}rtz, Tord Stordalen

PDF

Open Access

TL;DR

This paper introduces a new parameterized approach to the co-occurrence problem, providing optimal space and time bounds, and simplifies existing solutions with intuitive combinatorial methods.

Contribution

It presents a simple, optimal data structure for the co-occurrence problem based on a new parameter, improving understanding and efficiency over prior work.

Findings

01

The data structure uses O(d) space with O(log log n) query time.

02

O(d) space is proven to be optimal for the problem.

03

The bounds match the state of the art, with tight space complexity.

Abstract

Let $S$ be a string of length $n$ over an alphabet $Σ$ and let $Q$ be a subset of $Σ$ of size $q \geq 2$ . The 'co-occurrence problem' is to construct a compact data structure that supports the following query: given an integer $w$ return the number of length- $w$ substrings of $S$ that contain each character of $Q$ at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an $O (n q)$ space data structure that -- with some minor additions -- supports queries in $O (lo g lo g n)$ time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter $d$ , giving a simple data structure that uses $O (d)$ space and supports queries in $O (lo g lo g n)$ time. The preprocessing algorithm does a single pass over $S$ , runs in expected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · semigroups and automata theory · DNA and Biological Computing