Segmenting DNA sequence into `words'
Wang Liang

TL;DR
This paper introduces an unsupervised, n-gram based method for segmenting DNA sequences into words, identifying typical word lengths and providing a benchmark for evaluation.
Contribution
It presents a novel unsupervised segmentation approach for DNA sequences using statistical language models and establishes a benchmark for performance assessment.
Findings
Most DNA words are 12 to 15 base pairs long
The proposed method effectively segments DNA sequences
A benchmark for DNA segmentation methods is introduced
Abstract
This paper presents a novel method to segment/decode DNA sequences based on n-grams statistical language model. Firstly, we find the length of most DNA 'words' is 12 to 15 bps by analyzing the genomes of 12 model species. Then we design an unsupervised probability based approach to segment the DNA sequences. The benchmark of segmenting method is also proposed.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Algorithms and Data Compression
