Contextual Pattern Mining and Counting

Ling Li; Daniel Gibney; Sharma V. Thankachan; Solon P. Pissis; Grigorios Loukides

arXiv:2506.17613·cs.DS·June 24, 2025

Contextual Pattern Mining and Counting

Ling Li, Daniel Gibney, Sharma V. Thankachan, Solon P. Pissis, Grigorios Loukides

PDF

1 Repo

TL;DR

This paper introduces efficient algorithms and indexing methods for contextual pattern mining and counting in large datasets, enabling scalable analysis of string patterns with practical performance improvements.

Contribution

It presents novel linear-work algorithms for CPM and a space-efficient index for CPC, optimized with LZ77-based techniques for large-scale string datasets.

Findings

01

CPM algorithm handles billion-letter datasets with minimal internal memory.

02

CPC index outperforms state-of-the-art in query time and space.

03

Optimizations enable practical large dataset processing.

Abstract

Given a string $P$ of length $m$ , a longer string $T$ of length $n > m$ , and two integers $l \geq 0$ and $r \geq 0$ , the context of $P$ in $T$ is the set of all string pairs $(L, R)$ , with $∣ L ∣ = l$ and $∣ R ∣ = r$ , such that the string $L P R$ occurs in $T$ . We introduce two problems related to the notion of context: (1) the Contextual Pattern Mining (CPM) problem, which given $T$ , $(m, l, r)$ , and an integer $τ > 0$ , asks for outputting the context of each substring $P$ of length $m$ of $T$ , provided that the size of the context of $P$ is at least $τ$ ; and (2) the Contextual Pattern Counting (CPC) problem, which asks for preprocessing $T$ so that the size of the context of a given query string $P$ of length $m$ can be found efficiently. For CPM, we propose a linear-work algorithm that either uses only internal memory, or a bounded amount of internal memory and external memory, which allows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lingli97/cpm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.