The Compressed Overlap Index

Rodrigo Canovas; Bastien Cazaux; Eric Rivals

arXiv:1707.05613·cs.DS·July 19, 2017

The Compressed Overlap Index

Rodrigo Canovas, Bastien Cazaux, Eric Rivals

PDF

TL;DR

The paper introduces COvI, a compressed overlap index that efficiently stores and queries overlaps among large sets of words, outperforming traditional methods in memory usage and query speed.

Contribution

We designed COvI, a hierarchical, non-redundant data structure for overlap queries, demonstrating significant improvements over the baseline in memory efficiency and query performance.

Findings

01

COvI handles millions of words efficiently.

02

COvI uses half the memory of the baseline.

03

COvI answers complex overlap queries faster.

Abstract

For analysing text algorithms, for computing superstrings, or for testing random number generators, one needs to compute all overlaps between any pairs of words in a given set. The positions of overlaps of a word onto itself, or of two words, are needed to compute the absence probability of a word in a random text, or the numbers of common words shared by two random texts. In all these contexts, one needs to compute or to query overlaps between pairs of words in a given set. For this sake, we designed COvI, a compressed overlap index that supports multiple queries on overlaps: like computing the correlation of two words, or listing pairs of words whose longest overlap is maximal among all possible pairs. COvI stores overlaps in a hierarchical and non-redundant manner. We propose an implementation that can handle datasets of millions of words and still answer queries efficiently.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.