TL;DR
This paper introduces efficient algorithms and indexing methods for contextual pattern mining and counting in large datasets, enabling scalable analysis of string patterns with practical performance improvements.
Contribution
It presents novel linear-work algorithms for CPM and a space-efficient index for CPC, optimized with LZ77-based techniques for large-scale string datasets.
Findings
CPM algorithm handles billion-letter datasets with minimal internal memory.
CPC index outperforms state-of-the-art in query time and space.
Optimizations enable practical large dataset processing.
Abstract
Given a string of length , a longer string of length , and two integers and , the context of in is the set of all string pairs , with and , such that the string occurs in . We introduce two problems related to the notion of context: (1) the Contextual Pattern Mining (CPM) problem, which given , , and an integer , asks for outputting the context of each substring of length of , provided that the size of the context of is at least ; and (2) the Contextual Pattern Counting (CPC) problem, which asks for preprocessing so that the size of the context of a given query string of length can be found efficiently. For CPM, we propose a linear-work algorithm that either uses only internal memory, or a bounded amount of internal memory and external memory, which allows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
