Various improvements to text fingerprinting

Djamal Belazzougui; Roman Kolpakov; Mathieu Raffinot

arXiv:1301.3488·cs.DS·January 16, 2013

Various improvements to text fingerprinting

Djamal Belazzougui, Roman Kolpakov, Mathieu Raffinot

PDF

Open Access

TL;DR

This paper introduces new algorithms and data structures to efficiently compute and query all fingerprints of substrings within a text, addressing problems of enumeration, membership, and localization of fingerprints.

Contribution

It presents novel exact and approximate methods for computing fingerprints, checking fingerprint existence, and locating all maximal occurrences in the text.

Findings

01

Algorithms for computing all fingerprints are more efficient.

02

Data structures enable fast fingerprint membership queries.

03

Methods improve performance for substring fingerprint analysis.

Abstract

Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet \Sigma of size \sigma. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n \sigma). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory