Statistical linguistic study of DNA sequences
K. L. Ng, S. P. Li

TL;DR
This study applies a novel statistical linguistic approach using compound Poisson distributions to analyze DNA sequences, revealing non-random structures and subsequence patterns in nucleotide arrangements.
Contribution
Introduces a new family of compound Poisson distribution functions to model DNA sequence features, demonstrating their effectiveness in capturing sequence structure.
Findings
DNA sequences generally follow the compound Poisson distribution.
Some sequences fit the distribution with good goodness-of-fit.
DNA sequences are not purely random and may contain subsequence structures.
Abstract
A new family of compound Poisson distribution functions from statistical linguistic is used to study the n-tuples and nucleotide composition features of DNA sequences. The relative frequency distribution of the 6-tuples and 7- tuples occurrence studies suggest that most of the DNA sequences follow the general shape of the compound Poisson distribution. It is also noted that the -square test indicated that some of the sequences follow this distribution with a reasonable level of goodness of fit. The compositional segmentation study fits quite well using this new family of distribution functions. Furthermore, the absolute values of the relative frequency come out naturally from the linguistic model without ambiguity. It is suggesting that DNA sequences are not random sequences and they could possibly have subsequence structures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · RNA and protein synthesis mechanisms
