The Compressed Overlap Index
Rodrigo Canovas, Bastien Cazaux, Eric Rivals

TL;DR
The paper introduces COvI, a compressed overlap index that efficiently stores and queries overlaps among large sets of words, outperforming traditional methods in memory usage and query speed.
Contribution
We designed COvI, a hierarchical, non-redundant data structure for overlap queries, demonstrating significant improvements over the baseline in memory efficiency and query performance.
Findings
COvI handles millions of words efficiently.
COvI uses half the memory of the baseline.
COvI answers complex overlap queries faster.
Abstract
For analysing text algorithms, for computing superstrings, or for testing random number generators, one needs to compute all overlaps between any pairs of words in a given set. The positions of overlaps of a word onto itself, or of two words, are needed to compute the absence probability of a word in a random text, or the numbers of common words shared by two random texts. In all these contexts, one needs to compute or to query overlaps between pairs of words in a given set. For this sake, we designed COvI, a compressed overlap index that supports multiple queries on overlaps: like computing the correlation of two words, or listing pairs of words whose longest overlap is maximal among all possible pairs. COvI stores overlaps in a hierarchical and non-redundant manner. We propose an implementation that can handle datasets of millions of words and still answer queries efficiently.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
