Average Size of a Suffix Tree for Markov Sources
Philippe Jacquet, Wojciech Szpankowski

TL;DR
This paper analyzes the average size of suffix trees generated from Markov sources, revealing an asymptotic equivalence to tries built from independent sequences, and provides a formula for trie size under Markov models.
Contribution
It extends the understanding of suffix tree size from memoryless to Markov sources and derives a new formula for trie size in this context.
Findings
Average suffix tree size asymptotically matches trie size for Markov sources
Derived a formula for trie size under Markovian models
Applied novel analytic combinatorics techniques to word patterns
Abstract
We study a suffix tree built from a sequence generated by a Markovian source. Such sources are more realistic probabilistic models for text generation, data compression, molecular applications, and so forth. We prove that the average size of such a suffix tree is asymptotically equivalent to the average size of a trie built over independent sequences from the same Markovian source. This equivalence is only known for memoryless sources. We then derive a formula for the size of a trie under Markovian model to complete the analysis for suffix trees. We accomplish our goal by applying some novel techniques of analytic combinatorics on words also known as analytic pattern matching.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory
