Space-efficient detection of unusual words

Djamal Belazzougui; Fabio Cunial

arXiv:1508.02968·cs.DS·August 13, 2015·1 cites

Space-efficient detection of unusual words

Djamal Belazzougui, Fabio Cunial

PDF

Open Access

TL;DR

This paper presents a space-efficient algorithm for detecting unusual words in large texts, using a suffix tree and Burrows-Wheeler transform, enabling scalable analysis of genomes and metagenomes.

Contribution

The authors introduce a novel, space-efficient algorithm based on the BWT and suffix trees that improves scalability for string mining tasks involving large datasets.

Findings

01

Uses small data structures based on BWT and suffix trees

02

Reduces space complexity to handle large genomes

03

Enables detection of under-represented strings without full enumeration

Abstract

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of $O (σ^{2} lo g^{2} n)$ bits, where $n$ is the length of the string and $σ$ is the size of the alphabet. The size of the stack is $o (n)$ except for very large values of $σ$ . We further improve the algorithm by removing its time dependency on $σ$ , by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Genomics and Phylogenetic Studies · Natural Language Processing Techniques