Markovian language model of the DNA and its information content
Shambhavi Srivastava, Murilo S. Baptista

TL;DR
This paper introduces a Markovian model for DNA sequences that simplifies their complexity, enabling accurate prediction of nucleotide group transitions and analysis of informational content using a network-based approach.
Contribution
The work presents a novel Markovian language model for DNA that captures its grammatical structure and allows for efficient analysis of genetic information and gene similarity detection.
Findings
High accuracy in predicting DNA group transitions
Reduced DNA to a network of tens of nodes for analysis
Identification of sequences responsible for most information content
Abstract
This work proposes a markovian memoryless model for the DNA that simplifies enormously the complexity of it. We encode nucleotide sequences into symbolic sequences, called words, from which we establish meaningful length of words and group of words that share symbolic similarities. Interpreting a node to represent a group of similar words and edges to represent their functional connectivity allows us to construct a network of the grammatical rules governing the appearance of group of words in the DNA. Our model allows to predict the transition between group of words in the DNA with unprecedented accuracy, and to easily calculate many informational quantities to better characterize the DNA. In addition, we reduce the DNA of known bacteria to a network of only tens of nodes, show how our model can be used to detect similar (or dissimilar) genes in different organisms, and which sequences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Machine Learning in Bioinformatics · RNA and protein synthesis mechanisms
