Symbolic Complexity for Nucleotide Sequences: A Sign of the Genome Structure
R. Salgado-Garcia, E. Ugalde

TL;DR
This paper presents a new method to estimate the complexity of symbolic sequences, applied to genomes, revealing exponential and linear behaviors in complexity functions and similarities among related species.
Contribution
Introduces a novel complexity estimation technique for symbolic sequences and applies it to genomes, uncovering characteristic complexity patterns and phylogenetic similarities.
Findings
Genomes exhibit exponential complexity for small words and linear for larger words.
Phylogenetically related species have similar complexity functions.
The method accurately estimates complexity in known symbolic dynamical systems.
Abstract
We introduce a method to estimate the complexity function of symbolic dynamical systems from a finite sequence of symbols. We test such complexity estimator on several symbolic dynamical systems whose complexity functions are known exactly. We use this technique to estimate the complexity function for genomes of several organisms under the assumption that a genome is a sequence produced by a (unknown) dynamical system. We show that the genome of several organisms share the property that their complexity functions behaves exponentially for words of small length () and linearly for word lengths in the range . It is also found that the species which are phylogenetically close each other have similar complexity functions calculated from a sample of their corresponding coding regions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
