Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics
Jie Ren, Kai Song, Minghua Deng, Gesine Reinert, Charles H. Cannon and, Fengzhu Sun

TL;DR
This paper develops statistical methods to infer Markov chain properties from short NGS reads, enabling alignment-free genomic comparisons and effective clustering of species based on sequence data.
Contribution
It introduces new inference techniques for Markovian properties from NGS short reads, including normal approximation and gamma distribution models, facilitating alignment-free genome analysis.
Findings
Estimated Markov order significantly impacts clustering accuracy.
Proposed methods effectively infer Markov properties from NGS data.
Clustering results align with known phylogenetic relationships.
Abstract
Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential. A plausible model for this underlying distribution of word counts is given through modelling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Fractal and DNA sequence analysis
