Coding Sequence Density Estimation Via Topological Pressure
David Koslicki, Daniel J. Thompson

TL;DR
This paper introduces a novel method using topological pressure from ergodic theory to estimate coding sequence density across genomes, enabling coarse predictions of gene-rich regions and distinguishing introns from exons.
Contribution
It develops a new approach based on topological pressure for genomic analysis, linking ergodic theory with practical predictions of CDS density and sequence classification.
Findings
Accurately predicts CDS density on multiple genomes at a coarse scale.
Uses triplet weightings to distinguish introns from exons.
Provides theoretical foundation via Thermodynamic Formalism.
Abstract
We give a new approach to coding sequence (CDS) density estimation in genomic analysis based on the topological pressure, which we develop from a well known concept in ergodic theory. Topological pressure measures the "weighted information content" of a finite word, and incorporates 64 parameters which can be interpreted as a choice of weight for each nucleotide triplet. We train the parameters so that the topological pressure fits the observed coding sequence density on the human genome, and use this to give ab initio predictions of CDS density over windows of size around 66,000bp on the genomes of Mus Musculus, Rhesus Macaque and Drososphilia Melanogaster. While the differences between these genomes are too great to expect that training on the human genome could predict, for example, the exact locations of genes, we demonstrate that our method gives reasonable estimates for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
